Transactions involving multiple postgres foreign servers

Started by Ashutosh Bapatabout 11 years ago193 messages

ashutosh.bapat@enterprisedb.com

about 11 years ago

1 attachment(s)

Hi All,
While looking at the patch for supporting inheritance on foreign tables, I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW there
is a possibility that some of them commit their changes while others
rollback theirs.

PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.

The logic to craft the prepared transaction ids is rudimentary and I am
open to suggestions for the same. I have following goals in mind while
crafting the transaction ids
1. Minimize the chances of crafting a transaction id which would conflict
with a concurrent transaction id on that foreign server.
2. Because of a limitation described later, DBA/user should be able to
identify the server which originated a remote transaction.
More can be found in comments above function pgfdw_get_prep_xact_id() in
the patch.

Limitations
---------------
1. After a transaction has been prepared on foreign server, if the
connection to that server is lost before the transaction is rolled back or
committed on that server, the transaction remains in prepared state
forever. Manual intervention would be needed to clean up such a transaction
(Hence the goal 2 above). Automating this process would require significant
changes to the transaction manager, so, left out of this patch, which I
thought would be better right now. If required, I can work on that part in
this patch itself.

2. 2PC is needed only when there are more than two foreign servers involved
in a transaction. Transactions on a single foreign server are handled well
right now. So, ideally, the code should detect if there are more than two
foreign server are involved in the transaction and only then use 2PC. But I
couldn't find a way to detect that without changing the transaction manager.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchtext/x-patch; charset=US-ASCII; name=pg_fdw_transact.patchDownload

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 116be7d..9492f14 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -42,20 +42,21 @@ typedef struct ConnCacheKey
 } ConnCacheKey;
 
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
 	PGconn	   *conn;			/* connection to foreign server, or NULL */
 	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	char		*prep_xact_name;	/* Name of prepared transaction on this connection */
 } ConnCacheEntry;
 
 /*
  * Connection cache (initialized on first use)
  */
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
@@ -135,20 +136,21 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
 		/* initialize new hashtable entry (key is already filled in) */
 		entry->conn = NULL;
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
 	 * connection is actually used.
 	 */
 
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
@@ -156,20 +158,21 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
 		entry->conn = connect_pg_server(server, user);
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
 	begin_remote_xact(entry);
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
@@ -507,20 +510,55 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
+ * pgfdw_get_prep_xact_id
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * The name should give enough hints to the DBA/user, in case a manual
+ * intervention is required as per the comment in its caller.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is part
+ * of the extension not the core.
+ *
+ * In order to make it easy for the DBA/user to identify the originating server
+ * (so that he can verify the status of the originating transaction), we should
+ * let the DBA/user configure a prefix to be used for prepared transaction ids,
+ * but that requires changes to the core, so left out of the work. Instead we
+ * use the serverid and userid, which help in creating unique ids as well give
+ * hints (albeit weak) to the originating transaction.
+ */
+static char *
+pgfdw_get_prep_xact_id(ConnCacheEntry *entry)
+{
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_xact_name = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+
+	snprintf(prep_xact_name, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								entry->key.serverid, entry->key.userid);
+	return prep_xact_name;
+}
+/*
  * pgfdw_xact_callback --- cleanup at main-transaction end.
  */
 static void
 pgfdw_xact_callback(XactEvent event, void *arg)
 {
 	HASH_SEQ_STATUS scan;
 	ConnCacheEntry *entry;
 
 	/* Quick exit if no connections were touched in this transaction. */
 	if (!xact_got_connection)
@@ -540,24 +578,48 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 			continue;
 
 		/* If it has an open remote transaction, try to close it */
 		if (entry->xact_depth > 0)
 		{
 			elog(DEBUG3, "closing remote transaction on connection %p",
 				 entry->conn);
 
 			switch (event)
 			{
+				char		*prep_xact_name;
+				StringInfo	command;
 				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
+					/* 
+					 * Prepare the transaction on the remote nodes, to check the
+					 * subsequent COMMIT would go through or not.
+					 * TODO:
+					 * We also don't need prepared transactions involving a
+					 * single foreign server. How do we know that the current
+					 * transaction involved only a single foreign server?
+					 */
 
+					prep_xact_name = pgfdw_get_prep_xact_id(entry);
+					command = makeStringInfo();
+					appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_xact_name);
+					do_sql_command(entry->conn, command->data);
+					/* The transaction got prepared, register this fact */
+					entry->prep_xact_name = prep_xact_name;
+					/*
+					 * XXX:
+					 * After this, if the server crashes or looses connection to
+					 * the foreign server, before COMMIT/ABORT message is sent
+					 * to the foreign server, the transaction prepared above would
+					 * remain in that state till DBA/user manually commits or
+					 * rolls it back. The automation of this step would require
+					 * changes in the core transaction manager and left out for
+					 * now.
+					 */
 					/*
 					 * If there were any errors in subtransactions, and we
 					 * made prepared statements, do a DEALLOCATE ALL to make
 					 * sure we get rid of all prepared statements. This is
 					 * annoying and not terribly bulletproof, but it's
 					 * probably not worth trying harder.
 					 *
 					 * DEALLOCATE ALL only exists in 8.3 and later, so this
 					 * constrains how old a server postgres_fdw can
 					 * communicate with.  We intentionally ignore errors in
@@ -568,86 +630,115 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					 */
 					if (entry->have_prep_stmt && entry->have_error)
 					{
 						res = PQexec(entry->conn, "DEALLOCATE ALL");
 						PQclear(res);
 					}
 					entry->have_prep_stmt = false;
 					entry->have_error = false;
 					break;
 				case XACT_EVENT_PRE_PREPARE:
-
 					/*
 					 * We disallow remote transactions that modified anything,
 					 * since it's not very reasonable to hold them open until
 					 * the prepared transaction is committed.  For the moment,
 					 * throw error unconditionally; later we might allow
 					 * read-only cases.  Note that the error will cause us to
 					 * come right back here with event == XACT_EVENT_ABORT, so
 					 * we'll clean up the connection state at that point.
 					 */
 					ereport(ERROR,
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot prepare a transaction that modified remote tables")));
 					break;
 				case XACT_EVENT_COMMIT:
+					/*
+					 * If we prepared a transaction on a foreign server, commit
+					 * it.
+					 */
+					if (entry->prep_xact_name)
+					{
+						command = makeStringInfo();
+						appendStringInfo(command, "COMMIT PREPARED '%s'",
+													entry->prep_xact_name);
+						do_sql_command(entry->conn, command->data);
+						entry->prep_xact_name = NULL;
+					}
+					else
+						/* Pre-commit should have closed the open transaction */
+						elog(ERROR, "missed cleaning up connection during pre-commit");
+
+					break;
 				case XACT_EVENT_PREPARE:
 					/* Pre-commit should have closed the open transaction */
 					elog(ERROR, "missed cleaning up connection during pre-commit");
 					break;
 				case XACT_EVENT_ABORT:
 					/* Assume we might have lost track of prepared statements */
 					entry->have_error = true;
 					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
+					if (entry->prep_xact_name)
+					{
+						command = makeStringInfo();
+						appendStringInfo(command, "ROLLBACK PREPARED '%s'", entry->prep_xact_name);
+						res = PQexec(entry->conn, command->data);
+						pfree(entry->prep_xact_name);
+						entry->prep_xact_name = NULL;
+					}
+					else
+						res = PQexec(entry->conn, "ABORT TRANSACTION");
 					/* Note: can't throw ERROR, it would be infinite loop */
 					if (PQresultStatus(res) != PGRES_COMMAND_OK)
 						pgfdw_report_error(WARNING, res, entry->conn, true,
 										   "ABORT TRANSACTION");
 					else
 					{
 						PQclear(res);
 						/* As above, make sure to clear any prepared stmts */
 						if (entry->have_prep_stmt && entry->have_error)
 						{
 							res = PQexec(entry->conn, "DEALLOCATE ALL");
 							PQclear(res);
 						}
 						entry->have_prep_stmt = false;
 						entry->have_error = false;
 					}
 					break;
 			}
 		}
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		/*
+		 * If we have aborted or committed the transaction, reset state to show
+		 * we're out of a transaction.
+		 */
+		if (event != XACT_EVENT_PRE_COMMIT)  
+			entry->xact_depth = 0;
 
 		/*
 		 * If the connection isn't in a good idle state, discard it to
 		 * recover. Next GetConnection will open a new connection.
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
 			PQfinish(entry->conn);
 			entry->conn = NULL;
 		}
 	}
 
 	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * If we have aborted or committed the transaction, we can now mark ourselves
+	 * as out of the transaction.
 	 */
-	xact_got_connection = false;
+	if (event != XACT_EVENT_PRE_COMMIT)  
+		xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
 pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index f7e11ed..e7e9bf7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3063,10 +3063,122 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Next test shows local transaction fails if "any" of the remote transactions
+-- fail to commit. But any COMMITted transaction on the remote servers remains
+-- COMMITTED.
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ae96684..f15d302 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -665,10 +665,95 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+-- Next test shows local transaction fails if "any" of the remote transactions
+-- fail to commit. But any COMMITted transaction on the remote servers remains
+-- COMMITTED.
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Ashutosh Bapat (#1)

Re: Transactions involving multiple postgres foreign servers

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

While looking at the patch for supporting inheritance on foreign tables, I
noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW there
is a possibility that some of them commit their changes while others
rollback theirs.

PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback() at
XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.

TBH, I think this is a pretty awful idea.

In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.

The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.

As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions). So this would absolutely
have to be a not-on-by-default option. But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly. So I can't imagine
anyone wanting to turn this on without a more robust answer than that.

Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction). I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.

It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.

In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Tom Lane (#2)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.

I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.

I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible. I don't think this patch,
as currently designed, is equal to that challenge, because
XACT_EVENT_PRE_COMMIT fires before the transaction is certain to
commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify
could still error out. We could have a hook that fires after that,
but that doesn't solve the problem if a user of that hook can itself
throw an error. Even if part of the API contract is that it's not
allowed to do so, the actual attempt to commit the change on the
remote side can fail due to - e.g. - a network interruption, and
that's go to be dealt with somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Robert Haas (#3)

Re: Transactions involving multiple postgres foreign servers

Robert Haas <robertmhaas@gmail.com> writes:

I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.

That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there. Certainly the proposed patch doesn't
look like it moves us very far towards that set of goalposts.

I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible.

Perhaps. I suspect that it might still be a dead end if the ultimate
goal is cross-system atomic commit ... but likely it would teach us
some useful things anyway.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Tom Lane (#4)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there.

Why?

It would be surprising to me if, given that we have gone to some pains
to create a system that allows cross-system queries, and hopefully
eventually pushdown of quals, joins, and aggregates, we then made
sharding work in some completely different way that reuses none of
that infrastructure. But maybe I am looking at this the wrong way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

about 11 years ago

In reply to: Robert Haas (#5)

Re: Transactions involving multiple postgres foreign servers

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Jan 5, 2015 at 2:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

That's a laudable goal, but I would bet that nothing built on the FDW
infrastructure will ever get there.

Why?

It would be surprising to me if, given that we have gone to some pains
to create a system that allows cross-system queries, and hopefully
eventually pushdown of quals, joins, and aggregates, we then made
sharding work in some completely different way that reuses none of
that infrastructure. But maybe I am looking at this the wrong way.

Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.

I think the idea I sketched upthread of supporting an external transaction
manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.

Independently of that, I think what you are talking about would be better
addressed outside the constraints of the FDW mechanism. That's not to say
that we couldn't possibly make postgres_fdw use some additional non-FDW
infrastructure to manage commits; just that solving this in terms of the
FDW infrastructure seems wrongheaded to me.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Tom Lane (#6)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.

Atomic commit is something that can potentially be supported by many
different FDWs, as long as the thing on the other end supports 2PC.
If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC,
then you can PREPARE the transaction and then go back and COMMIT the
transaction once it's committed locally. Getting a cluster-wide
*snapshot* is probably a PostgreSQL-only thing requiring much deeper
integration, but I think it would be sensible to leave that as a
future project and solve the simpler problem first.

I think the idea I sketched upthread of supporting an external transaction
manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.

An important threshold question here is whether we want to rely on an
external transaction manager, or build one into PostgreSQL. As far as
this particular project goes, there's nothing that can't be done
inside PostgreSQL. You need a durable registry of which transactions
you prepared on which servers, and which XIDs they correlate to. If
you have that, then you can use background workers or similar to go
retry commits or rollbacks of prepared transactions until it works,
even if there's been a local crash meanwhile.

Alternatively, you could rely on an external transaction manager to do
all that stuff. I don't have a clear sense of what that would entail,
or how it might be better or worse than rolling our own. I suspect,
though, that it might amount to little more than adding a middle man.
I mean, a third-party transaction manager isn't going to automatically
know how to commit a transaction prepared on some foreign server using
some foreign data wrapper. It's going to be have to be taught that if
postgres_fdw leaves a transaction in-medias-res on server OID 1234,
you've got to connect to the target machine using that foreign
server's connection parameters, speak libpq, and issue the appropriate
COMMIT TRANSACTION command. And similarly, you're going to need to
arrange to notify it before preparing that transaction so that it
knows that it needs to request the COMMIT or ABORT later on. Once
you've got all of that infrastructure for that in place, what are you
really gaining over just doing it in PostgreSQL (or, say, a contrib
module thereto)?

(I'm also concerned that an external transaction manager might need
the PostgreSQL client to be aware of it, whereas what we'd really like
here is for the client to just speak PostgreSQL and be happy that its
commits no longer end up half-done.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Robert Haas (#3)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jan 5, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 2, 2015 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.

I was involved in some internal discussions related to this patch, so
I have some opinions on it. The long-term, high-level goal here is to
facilitate sharding. If we've got a bunch of PostgreSQL servers
interlinked via postgres_fdw, it should be possible to perform
transactions on the cluster in such a way that transactions are just
as atomic, consistent, isolated, and durable as they would be with
just one server. As far as I know, there is no way to achieve this
goal through the use of an external transaction manager, because even
if that external transaction manager guarantees, for every
transaction, that the transaction either commits on all nodes or rolls
back on all nodes, there's no way for it to guarantee that other
transactions won't see some intermediate state where the commit has
been completed on some nodes but not others. To get that, you need
some of integration that reaches down to the way snapshots are taken.

I think, though, that it might be worthwhile to first solve the
simpler problem of figuring out how to ensure that a transaction
commits everywhere or rolls back everywhere, even if intermediate
states might still be transiently visible.

Agreed.

I don't think this patch,
as currently designed, is equal to that challenge, because
XACT_EVENT_PRE_COMMIT fires before the transaction is certain to
commit - PreCommit_CheckForSerializationFailure or PreCommit_Notify
could still error out. We could have a hook that fires after that,
but that doesn't solve the problem if a user of that hook can itself
throw an error. Even if part of the API contract is that it's not
allowed to do so, the actual attempt to commit the change on the
remote side can fail due to - e.g. - a network interruption, and
that's go to be dealt with somehow.

Tom mentioned
--
in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.
--
You have given a specific example of this case. So, let me dry run through
CommitTransaction() after applying my patch.
1899 CallXactCallbacks(XACT_EVENT_PRE_COMMIT);

While processing this event in postgres_fdw's callback
pgfdw_xact_callback() sends a PREPARE TRANSACTION to all the foreign
servers involved. These servers return with their success or failures. Even
if one of them fails, the local transaction is aborted along-with all the
prepared transactions. Only if all the foreign servers succeed we proceed
further.

1925 PreCommit_CheckForSerializationFailure();
1926
1932 PreCommit_Notify();
1933

If any of these function (as you mentioned above), throws errors, the local
transaction will be aborted as well as the remote prepared transactions.
Note, that we haven't yet committed the local transaction (which will be
done below) and also not the remote transactions which are in PREPAREd
state there. Since all the transactions local as well as remote are aborted
in case of error, the data is still consistent. If these steps succeed, we
will proceed ahead.

1934 /* Prevent cancel/die interrupt while cleaning up */
1935 HOLD_INTERRUPTS();
1936
1937 /* Commit updates to the relation map --- do this as late as
possible */
1938 AtEOXact_RelationMap(true);
1939
1940 /*
1941 * set the current transaction state information appropriately
during
1942 * commit processing
1943 */
1944 s->state = TRANS_COMMIT;
1945
1946 /*
1947 * Here is where we really truly commit.
1948 */
1949 latestXid = RecordTransactionCommit();
1950
1951 TRACE_POSTGRESQL_TRANSACTION_COMMIT(MyProc->lxid);
1952
1953 /*
1954 * Let others know about no transaction in progress by me. Note
that this
1955 * must be done _before_ releasing locks we hold and _after_
1956 * RecordTransactionCommit.
1957 */
1958 ProcArrayEndTransaction(MyProc, latestXid);
1959

Local transaction committed. Remote transactions still in PREPAREd state.
Any server (including local) crash or link failure happens here, we leave
the remote transactions dangling in PREPAREd state and manual cleanup will
be required.

1975
1976 CallXactCallbacks(XACT_EVENT_COMMIT);

The postgresql callback pgfdw_xact_callback() commits the PREPAREd
transactions by sending COMMIT TRANSACTION to remote server (my patch). So,
I don't see why would my patch cause inconsistencies. It can cause dangling
PREPAREd transactions and I have already acknowledged that fact.

Am I missing something?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Tom Lane (#2)

Re: Transactions involving multiple postgres foreign servers

On Sat, Jan 3, 2015 at 2:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:

While looking at the patch for supporting inheritance on foreign tables,

I

noticed that if a transaction makes changes to more than two foreign
servers the current implementation in postgres_fdw doesn't make sure that
either all of them rollback or all of them commit their changes, IOW

there

is a possibility that some of them commit their changes while others
rollback theirs.

PFA patch which uses 2PC to solve this problem. In pgfdw_xact_callback()

at

XACT_EVENT_PRE_COMMIT event, it sends prepares the transaction at all the
foreign postgresql servers and at XACT_EVENT_COMMIT or XACT_EVENT_ABORT
event it commits or aborts those transactions resp.

TBH, I think this is a pretty awful idea.

In the first place, this does little to improve the actual reliability
of a commit occurring across multiple foreign servers; and in the second
place it creates a bunch of brand new failure modes, many of which would
require manual DBA cleanup.

The core of the problem is that this doesn't have anything to do with
2PC as it's commonly understood: for that, you need a genuine external
transaction manager that is aware of all the servers involved in a
transaction, and has its own persistent state (or at least a way to
reconstruct its own state by examining the per-server states).
This patch is not that; in particular it treats the local transaction
asymmetrically from the remote ones, which doesn't seem like a great
idea --- ie, the local transaction could still abort after committing
all the remote ones, leaving you no better off in terms of cross-server
consistency.

As far as failure modes go, one basic reason why this cannot work as
presented is that the remote servers may not even have prepared
transaction support enabled (in fact max_prepared_transactions = 0
is the default in all supported PG versions). So this would absolutely
have to be a not-on-by-default option.

Agreed. We can have a per foreign server option, which says whether the
corresponding server can participate in 2PC. A transaction spanning
multiple foreign server with at least one of them not capable of
participating in 2PC will need to be aborted.

But the bigger issue is that
leaving it to the DBA to clean up after failures is not a production
grade solution, *especially* not for prepared transactions, which are
performance killers if not closed out promptly. So I can't imagine
anyone wanting to turn this on without a more robust answer than that.

I purposefully left that outside this patch, since it involves significant
changes in core. If that's necessary for the first cut, I will work on it.

Basically I think what you'd need for this to be a credible patch would be
for it to work by changing the behavior only in the PREPARE TRANSACTION
path: rather than punting as we do now, prepare the remote transactions,
and report their server identities and gids to an external transaction
manager, which would then be responsible for issuing the actual commits
(along with the actual commit of the local transaction). I have no idea
whether it's feasible to do that without having to assume a particular
2PC transaction manager API/implementation.

I doubt if a TM would expect a bunch of GIDs in response to PREPARE
TRANSACTION command. Per X/Open xa_prepare() expects an integer return
value, specifying whether the PREPARE succeeded or not and some piggybacked
statuses.

In the context of foreign table under inheritance tree, a single DML can
span multiple foreign servers. All such DMLs will then need to be handled
by an external TM. An external TM or application may not have exact idea as
to which all foreign servers are going to be affected by a DML. Users may
not want to setup an external TM in such cases. Instead they would expect
PostgreSQL to manage such DMLs and transactions all by itself.

As Robert has suggested in his responses, it would be better to enable
PostgreSQL to manage distributed transactions itself.

It'd be interesting to hear from people who are using 2PC in production
to find out if this would solve any real-world problems for them, and
what the details of the TM interface would need to look like to make it
work in practice.

In short, you can't force 2PC technology on people who aren't using it
already; while for those who are using it already, this isn't nearly
good enough as-is.

regards, tom lane

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#10

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Robert Haas (#7)

Re: Transactions involving multiple postgres foreign servers

On Tue, Jan 6, 2015 at 11:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 5, 2015 at 3:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, we intentionally didn't couple the FDW stuff closely into
transaction commit, because of the thought that the "far end" would not
necessarily have Postgres-like transactional behavior, and even if it did
there would be about zero chance of having atomic commit with a
non-Postgres remote server. postgres_fdw is a seriously bad starting
point as far as that goes, because it encourages one to make assumptions
that can't possibly work for any other wrapper.

Atomic commit is something that can potentially be supported by many
different FDWs, as long as the thing on the other end supports 2PC.
If you're talking to Oracle or DB2 or SQL Server, and it supports 2PC,
then you can PREPARE the transaction and then go back and COMMIT the
transaction once it's committed locally.

Getting a cluster-wide

*snapshot* is probably a PostgreSQL-only thing requiring much deeper
integration, but I think it would be sensible to leave that as a
future project and solve the simpler problem first.

I think the idea I sketched upthread of supporting an external

transaction

manager might be worth pursuing, in that it would potentially lead to
having at least an approximation of atomic commit across heterogeneous
servers.

An important threshold question here is whether we want to rely on an
external transaction manager, or build one into PostgreSQL. As far as
this particular project goes, there's nothing that can't be done
inside PostgreSQL. You need a durable registry of which transactions
you prepared on which servers, and which XIDs they correlate to. If
you have that, then you can use background workers or similar to go
retry commits or rollbacks of prepared transactions until it works,
even if there's been a local crash meanwhile.

Alternatively, you could rely on an external transaction manager to do
all that stuff. I don't have a clear sense of what that would entail,
or how it might be better or worse than rolling our own. I suspect,
though, that it might amount to little more than adding a middle man.
I mean, a third-party transaction manager isn't going to automatically
know how to commit a transaction prepared on some foreign server using
some foreign data wrapper. It's going to be have to be taught that if
postgres_fdw leaves a transaction in-medias-res on server OID 1234,
you've got to connect to the target machine using that foreign
server's connection parameters, speak libpq, and issue the appropriate
COMMIT TRANSACTION command. And similarly, you're going to need to
arrange to notify it before preparing that transaction so that it
knows that it needs to request the COMMIT or ABORT later on. Once
you've got all of that infrastructure for that in place, what are you
really gaining over just doing it in PostgreSQL (or, say, a contrib
module thereto)?

Thanks Robert for giving high level view of system needed for PostgreSQL to
be a transaction manager by itself. Agreed completely.

(I'm also concerned that an external transaction manager might need
the PostgreSQL client to be aware of it, whereas what we'd really like
here is for the client to just speak PostgreSQL and be happy that its
commits no longer end up half-done.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#11

Kevin Grittner

kgrittn@ymail.com

about 11 years ago

In reply to: Ashutosh Bapat (#8)

Re: Transactions involving multiple postgres foreign servers

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I don't see why would my patch cause inconsistencies. It can
cause dangling PREPAREd transactions and I have already
acknowledged that fact.

Am I missing something?

To me that is the big problem. Where I have run into ad hoc
distributed transaction managers it has usually been because a
crash left prepared transactions dangling, without cleaning them up
when the transaction manager was restarted. This tends to wreak
havoc one way or another.

If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.

Doing this incompletely (i.e., not identifying and correctly
handling the various failure modes) is IMO far worse than not
attempting it. If we could build in something that did this
completely and well, that would be a cool selling point; but let's
not gloss over the difficulties. We must recognize how big a
problem it would be to include a low-quality implementation.

Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Kevin Grittner (#11)

Re: Transactions involving multiple postgres foreign servers

On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

I don't see why would my patch cause inconsistencies. It can
cause dangling PREPAREd transactions and I have already
acknowledged that fact.

Am I missing something?

To me that is the big problem. Where I have run into ad hoc
distributed transaction managers it has usually been because a
crash left prepared transactions dangling, without cleaning them up
when the transaction manager was restarted. This tends to wreak
havoc one way or another.

If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems.

Thanks a lot. I hadn't thought of this.

After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.

Agreed.

Doing this incompletely (i.e., not identifying and correctly
handling the various failure modes) is IMO far worse than not
attempting it. If we could build in something that did this
completely and well, that would be a cool selling point; but let's
not gloss over the difficulties. We must recognize how big a
problem it would be to include a low-quality implementation.

Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.

I described one possibility in my reply to Tom's mail. Let me repeat it
here.

We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction spanning
multiple foreign server with at least one of them not capable of
participating in 2PC will be aborted.

Will that work?

In case a user flags a foreign server as capable to 2PC incorrectly, I
expect the corresponding FDW would raise error (either because PREPARE
fails or FDW doesn't handle that case) and the transaction will be aborted
anyway.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#13

Kevin Grittner

kgrittn@ymail.com

about 11 years ago

In reply to: Ashutosh Bapat (#12)

Re: Transactions involving multiple postgres foreign servers

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.

We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction
spanning multiple foreign server with at least one of them not
capable of participating in 2PC will be aborted.

Will that work?

In case a user flags a foreign server as capable to 2PC
incorrectly, I expect the corresponding FDW would raise error
(either because PREPARE fails or FDW doesn't handle that case)
and the transaction will be aborted anyway.

That sounds like one way to handle it. I'm not clear on how you
plan to determine whether 2PC is required for a transaction.
(Apologies if it was previously mentioned and I've forgotten it.)

I don't mean to suggest that these problems are insurmountable; I
just think that people often underestimate the difficulty of
writing a distributed transaction manager and don't always
recognize the problems that it will cause if all of the failure
modes are not considered and handled.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Kevin Grittner (#11)

Re: Transactions involving multiple postgres foreign servers

On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.

I think one key question here is whether all of this should be handled
in PostgreSQL core or whether some of it should be handled in other
ways. Is the goal to make postgres_fdw (and FDWs for other databases
that support 2PC) to persist enough information that someone *could*
write a transaction manager for PostgreSQL, or is the goal to actually
write that transaction manager?

Just figuring out how to persist the necessary information is a
non-trivial problem by itself. You might think that you could just
insert a row into a local table saying, hey, I'm about to prepare a
transaction remotely, but of course that doesn't work: if you then go
on to PREPARE before writing and flushing the local commit record,
then a crash before that's done leaves a dangling prepared transaction
on the remote note. You might think to write the record, then after
writing and flush the local commit record do the PREPARE. But you
can't do that either, because now if the PREPARE fails you've already
committed locally.

I guess what you need to do is something like:

1. Write and flush a WAL record indicating an intent to prepare, with
a list of foreign server OIDs and GUIDs.
2. Prepare the remote transaction on each node. If any of those
operations fail, roll back any prepared nodes and error out.
3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL).
4. Try to commit the remote transactions.
5. Write a WAL record indicating that you committed the remote transactions OK.

If you fail after step 1, you can straighten things out by looking at
the status of the transaction: if the transaction committed, any
transactions we intended-to-prepare need to be checked. If they are
still prepared, we need to commit them or roll them back according to
what happened to our XID.

(Andres is talking in my other ear suggesting that we ought to reuse
the 2PC infrastructure to do all this. I'm not convinced that's a
good idea, but I'll let him present his own ideas here if he wants to
rather than trying to explain them myself.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Kevin Grittner

kgrittn@ymail.com

about 11 years ago

In reply to: Robert Haas (#14)

Re: Transactions involving multiple postgres foreign servers

Robert Haas <robertmhaas@gmail.com> wrote:

Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.

If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.

I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.) That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Kevin Grittner (#15)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.

If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.

Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.

I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)

We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.

That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.

I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Kevin Grittner

kgrittn@ymail.com

about 11 years ago

In reply to: Robert Haas (#16)

Re: Transactions involving multiple postgres foreign servers

Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.

If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.

Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.

The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.

I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)

We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.

What about the case where one or more nodes doesn't support 2PC.
Do we silently make the choice, without the client really knowing?

That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.

I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.

As long as people are aware that there is an urgent need to find
and fix all data stores to which clusters on the failed machine
were connected via FDW when there is a hard machine failure, I
guess it is OK. In essence we just document it and declare it to
be somebody else's problem. In general I would expect a
distributed transaction manager to behave well in the face of any
single-machine failure, but if there is one aspect of a
full-featured distributed transaction manager we could give up, I
guess that would be it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Kevin Grittner (#13)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 8, 2015 at 7:02 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jan 7, 2015 at 9:50 PM, Kevin Grittner <kgrittn@ymail.com>

wrote:

Also, as previously mentioned, it must behave in some reasonable
way if a database is not configured to support 2PC, especially
since 2PC is off by default in PostgreSQL.

We can have a per foreign server option, which says whether the
corresponding server is able to participate in 2PC. A transaction
spanning multiple foreign server with at least one of them not
capable of participating in 2PC will be aborted.

Will that work?

In case a user flags a foreign server as capable to 2PC
incorrectly, I expect the corresponding FDW would raise error
(either because PREPARE fails or FDW doesn't handle that case)
and the transaction will be aborted anyway.

That sounds like one way to handle it. I'm not clear on how you
plan to determine whether 2PC is required for a transaction.
(Apologies if it was previously mentioned and I've forgotten it.)

Any transaction involving more than one server (including local one, I
guess), will require two PC. A transaction may modify and access remote
database but not local one. In such a case, the state of local transaction
doesn't matter once the remote transaction is committed or rolled back.

I don't mean to suggest that these problems are insurmountable; I
just think that people often underestimate the difficulty of
writing a distributed transaction manager and don't always
recognize the problems that it will cause if all of the failure
modes are not considered and handled.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#19

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 11 years ago

In reply to: Robert Haas (#14)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 8, 2015 at 8:24 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 7, 2015 at 11:20 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

If we are going to include a distributed transaction manager with
PostgreSQL, it *must* persist enough information about the
transaction ID and where it is used in a way that will survive a
subsequent crash before beginning the PREPARE on any of the
systems. After all nodes are PREPAREd it must flag that persisted
data to indicate that it is now at a point where ROLLBACK is no
longer an option. Only then can it start committing the prepared
transactions. After the last node is committed it can clear this
information. On start-up the distributed transaction manager must
check for any distributed transactions left "in progress" and
commit or rollback based on the preceding; doing retries
indefinitely until it succeeds or is told to stop.

I think one key question here is whether all of this should be handled
in PostgreSQL core or whether some of it should be handled in other
ways. Is the goal to make postgres_fdw (and FDWs for other databases
that support 2PC) to persist enough information that someone *could*
write a transaction manager for PostgreSQL, or is the goal to actually
write that transaction manager?

Just figuring out how to persist the necessary information is a
non-trivial problem by itself. You might think that you could just
insert a row into a local table saying, hey, I'm about to prepare a
transaction remotely, but of course that doesn't work: if you then go
on to PREPARE before writing and flushing the local commit record,
then a crash before that's done leaves a dangling prepared transaction
on the remote note. You might think to write the record, then after
writing and flush the local commit record do the PREPARE. But you
can't do that either, because now if the PREPARE fails you've already
committed locally.

I guess what you need to do is something like:

1. Write and flush a WAL record indicating an intent to prepare, with
a list of foreign server OIDs and GUIDs.
2. Prepare the remote transaction on each node. If any of those
operations fail, roll back any prepared nodes and error out.
3. Commit locally (i.e. RecordTransactionCommit, writing and flushing WAL).
4. Try to commit the remote transactions.
5. Write a WAL record indicating that you committed the remote
transactions OK.

If you fail after step 1, you can straighten things out by looking at
the status of the transaction: if the transaction committed, any
transactions we intended-to-prepare need to be checked. If they are
still prepared, we need to commit them or roll them back according to
what happened to our XID.

When you want to strengthen and commit things, the foreign server may not
be available to do that. As Kevin pointed out in above, we need to keep on
retrying to resolve (commit or rollback based on the status of local
transaction) the PREPAREd transactions on foreign server till they are
resolved. So, we will have to persist the information somewhere else than
the WAL OR keep on persisting the WALs even after the corresponding local
transaction has been committed or aborted, which I don't think is a good
idea, since that will have impact on replication, VACUUM esp. because it's
going to affect the oldest transaction in WAL.

That's where Andres's suggestion might help.

(Andres is talking in my other ear suggesting that we ought to reuse
the 2PC infrastructure to do all this. I'm not convinced that's a
good idea, but I'll let him present his own ideas here if he wants to
rather than trying to explain them myself.)

We can persist the information about distributed transaction (which esp.
require 2PC) similar to the way as 2PC infrastructure in pg_twophase
directory. I am still investigating whether we can re-use existing 2PC
infrastructure or not. My initial reaction is no, since 2PC persists
information about local transaction including locked objects, WALs (?) in
pg_twophase directory, which is not required for a distributed transaction.
But rest of the mechanism like the manner of processing the records during
normal processing and recovery looks very useful.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#20

Jim Nasby

Jim.Nasby@BlueTreble.com

about 11 years ago

In reply to: Kevin Grittner (#17)

Re: Transactions involving multiple postgres foreign servers

On 1/8/15, 12:00 PM, Kevin Grittner wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 8, 2015 at 10:19 AM, Kevin Grittner <kgrittn@ymail.com> wrote:

Robert Haas <robertmhaas@gmail.com> wrote:

Andres is talking in my other ear suggesting that we ought to
reuse the 2PC infrastructure to do all this.

If you mean that the primary transaction and all FDWs in the
transaction must use 2PC, that is what I was saying, although
apparently not clearly enough. All nodes *including the local one*
must be prepared and committed with data about the nodes saved
safely off somewhere that it can be read in the event of a failure
of any of the nodes *including the local one*. Without that, I see
this whole approach as a train wreck just waiting to happen.

Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.

The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.

If we had an independent transaction coordinator then I agree with you Kevin. I think Robert is proposing that if we are controlling one of the nodes that's participating as well as coordinating the overall transaction that we can take some shortcuts. AIUI a PREPARE means you are completely ready to commit. In essence you're just waiting to write and fsync the commit message. That is in fact the state that a coordinating PG node would be in by the time everyone else has done their prepare. So from that standpoint we're OK.

Now, as soon as ANY of the nodes commit, our coordinating node MUST be able to commit as well! That would require it to have a real prepared transaction of it's own created. However, as long as there is zero chance of any other prepared transactions committing before our local transaction, that step isn't actually needed. Our local transaction will either commit or abort, and that will determine what needs to happen on all other nodes.

I'm ignoring the question of how the local node needs to store info about the other nodes in case of a crash, but AFAICT you could reliably recover manually from what I just described.

I think the question is: are we OK with "going under the skirt" in this fashion? Presumably it would provide better performance, whereas forcing ourselves to eat our own 2PC dogfood would presumably make it easier for someone to plugin an external coordinator instead of using our own. I think there's also a lot to be said for getting a partial implementation of this available today (requiring manual recovery), so long as it's not in core.

BTW, I found https://www.cs.rutgers.edu/~pxk/417/notes/content/transactions.html a useful read, specifically the 2PC portion.

I'm not really clear on the mechanism that is being proposed for
doing this, but one way would be to have the PREPARE of the local
transaction be requested explicitly and to have that cause all FDWs
participating in the transaction to also be prepared. (That might
be what Andres meant; I don't know.)

We want this to be client-transparent, so that the client just says
COMMIT and everything Just Works.

What about the case where one or more nodes doesn't support 2PC.
Do we silently make the choice, without the client really knowing?

We abort. (Unless we want to have a running_with_scissors GUC...)

That doesn't strike me as the
only possible mechanism to drive this, but it might well be the
simplest and cleanest. The trickiest bit might be to find a good
way to persist the distributed transaction information in a way
that survives the failure of the main transaction -- or even the
abrupt loss of the machine it's running on.

I'd be willing to punt on surviving a loss of the entire machine. But
I'd like to be able to survive an abrupt reboot.

As long as people are aware that there is an urgent need to find
and fix all data stores to which clusters on the failed machine
were connected via FDW when there is a hard machine failure, I
guess it is OK. In essence we just document it and declare it to
be somebody else's problem. In general I would expect a
distributed transaction manager to behave well in the face of any
single-machine failure, but if there is one aspect of a
full-featured distributed transaction manager we could give up, I
guess that would be it.

ISTM that one option here would be to "simply" write and sync WAL record(s) of all externally prepared transactions. That would be enough for a hot standby to find all the other servers and tell them to either commit or abort, based on whether our local transaction committed or aborted. If you wanted, you could even have the standby be responsible for telling all the other participants to commit...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Jim Nasby (#20)

Re: Transactions involving multiple postgres foreign servers

On Sat, Jan 10, 2015 at 9:02 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/8/15, 12:00 PM, Kevin Grittner wrote:

The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.

Yes, 2PC is needed as long as more than 2 nodes perform write
operations within a transaction.

If we had an independent transaction coordinator then I agree with you
Kevin. I think Robert is proposing that if we are controlling one of the
nodes that's participating as well as coordinating the overall transaction
that we can take some shortcuts. AIUI a PREPARE means you are completely
ready to commit. In essence you're just waiting to write and fsync the
commit message. That is in fact the state that a coordinating PG node would
be in by the time everyone else has done their prepare. So from that
standpoint we're OK.

Now, as soon as ANY of the nodes commit, our coordinating node MUST be able
to commit as well! That would require it to have a real prepared transaction
of it's own created. However, as long as there is zero chance of any other
prepared transactions committing before our local transaction, that step
isn't actually needed. Our local transaction will either commit or abort,
and that will determine what needs to happen on all other nodes.

It is a property of 2PC to ensure that a prepared transaction will
commit. Now, once it is confirmed on the coordinator that all the
remote nodes have successfully PREPAREd, the coordinator issues COMMIT
PREPARED to each node. What do you do if some nodes report ABORT
PREPARED while other nodes report COMMIT PREPARED? Do you abort the
transaction on coordinator, commit it or FATAL? This lets the cluster
in an inconsistent state, meaning that some consistent cluster-wide
recovery point is needed as well (Postgres-XC and XL have introduced
the concept of barriers for such problems, stuff created first by
Pavan Deolassee).
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Jim Nasby

Jim.Nasby@BlueTreble.com

about 11 years ago

In reply to: Michael Paquier (#21)

Re: Transactions involving multiple postgres foreign servers

On 1/10/15, 7:11 AM, Michael Paquier wrote:

If we had an independent transaction coordinator then I agree with you

Kevin. I think Robert is proposing that if we are controlling one of the
nodes that's participating as well as coordinating the overall transaction
that we can take some shortcuts. AIUI a PREPARE means you are completely
ready to commit. In essence you're just waiting to write and fsync the
commit message. That is in fact the state that a coordinating PG node would
be in by the time everyone else has done their prepare. So from that
standpoint we're OK.

Now, as soon as ANY of the nodes commit, our coordinating node MUST be able
to commit as well! That would require it to have a real prepared transaction
of it's own created. However, as long as there is zero chance of any other
prepared transactions committing before our local transaction, that step
isn't actually needed. Our local transaction will either commit or abort,
and that will determine what needs to happen on all other nodes.

It is a property of 2PC to ensure that a prepared transaction will
commit. Now, once it is confirmed on the coordinator that all the
remote nodes have successfully PREPAREd, the coordinator issues COMMIT
PREPARED to each node. What do you do if some nodes report ABORT
PREPARED while other nodes report COMMIT PREPARED? Do you abort the
transaction on coordinator, commit it or FATAL? This lets the cluster
in an inconsistent state, meaning that some consistent cluster-wide
recovery point is needed as well (Postgres-XC and XL have introduced
the concept of barriers for such problems, stuff created first by
Pavan Deolassee).

My understanding is that once you get a successful PREPARE that should mean that it's basically impossible for the transaction to fail to commit. If that's not the case, I fail to see how you can get any decent level of sanity out of this...
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Jim Nasby (#22)

Re: Transactions involving multiple postgres foreign servers

On Sun, Jan 11, 2015 at 10:37 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 1/10/15, 7:11 AM, Michael Paquier wrote:

If we had an independent transaction coordinator then I agree with you

Kevin. I think Robert is proposing that if we are controlling one of the
nodes that's participating as well as coordinating the overall
transaction
that we can take some shortcuts. AIUI a PREPARE means you are completely
ready to commit. In essence you're just waiting to write and fsync the
commit message. That is in fact the state that a coordinating PG node
would
be in by the time everyone else has done their prepare. So from that
standpoint we're OK.

Now, as soon as ANY of the nodes commit, our coordinating node MUST be
able
to commit as well! That would require it to have a real prepared
transaction
of it's own created. However, as long as there is zero chance of any
other
prepared transactions committing before our local transaction, that step
isn't actually needed. Our local transaction will either commit or
abort,
and that will determine what needs to happen on all other nodes.

It is a property of 2PC to ensure that a prepared transaction will
commit. Now, once it is confirmed on the coordinator that all the
remote nodes have successfully PREPAREd, the coordinator issues COMMIT
PREPARED to each node. What do you do if some nodes report ABORT
PREPARED while other nodes report COMMIT PREPARED? Do you abort the
transaction on coordinator, commit it or FATAL? This lets the cluster
in an inconsistent state, meaning that some consistent cluster-wide
recovery point is needed as well (Postgres-XC and XL have introduced
the concept of barriers for such problems, stuff created first by
Pavan Deolassee).

My understanding is that once you get a successful PREPARE that should mean
that it's basically impossible for the transaction to fail to commit. If
that's not the case, I fail to see how you can get any decent level of
sanity out of this...

When giving the responsability of a group of COMMIT PREPARED to a set
of nodes in a network, there could be a couple of problems showing up,
of the type split-brain for example. There could be as well failures
at hardware-level, so you would need a mechanism ensuring that WAL is
consistent among all the nodes, with for example the addition of a
common restore point on all the nodes once PREPARE is successfully
done with for example XLOG_RESTORE_POINT. That's a reason why I think
that the local Coordinator should use 2PC as well, to ensure a
consistency point once all the remote nodes have successfully
PREPAREd, and a reason why things can get complicated for either the
DBA or the upper application in charge of ensuring the DB consistency
even in case of critical failures.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Kevin Grittner (#17)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 8, 2015 at 1:00 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

Clearly, all the nodes other than the local one need to use 2PC. I am
unconvinced that the local node must write a 2PC state file only to
turn around and remove it again almost immediately thereafter.

The key point is that the distributed transaction data must be
flagged as needing to commit rather than roll back between the
prepare phase and the final commit. If you try to avoid the
PREPARE, flagging, COMMIT PREPARED sequence by building the
flagging of the distributed transaction metadata into the COMMIT
process, you still have the problem of what to do on crash
recovery. You really need to use 2PC to keep that clean, I think.

I don't really follow this. You need to write a list of the
transactions that you're going to prepare to stable storage before
preparing any of them. And then you need to write something to stable
storage when you've definitively determined that you're going to
commit. But we have no current mechanism for the first thing (so
reusing 2PC doesn't help) and we already have the second thing (it's
the commit record itself).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Michael Paquier (#23)

Re: Transactions involving multiple postgres foreign servers

On Sun, Jan 11, 2015 at 3:36 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

My understanding is that once you get a successful PREPARE that should mean
that it's basically impossible for the transaction to fail to commit. If
that's not the case, I fail to see how you can get any decent level of
sanity out of this...

When giving the responsability of a group of COMMIT PREPARED to a set
of nodes in a network, there could be a couple of problems showing up,
of the type split-brain for example.

I think this is just confusing the issue. When a machine reports that
a transaction is successfully prepared, any future COMMIT PREPARED
operation *must* succeed. If it doesn't, the machine has broken its
promises, and that's not OK. Period. It doesn't matter whether
that's due to split-brain or sunspots or Oscar Wilde having bad
breath. If you say that it's prepared, then you're not allowed to
change your mind later and say that it can't be committed. If you do,
then you have a broken 2PC implementation and, as Jim says, all bets
are off.

Now of course nothing is certain in life except death and taxes. If
you PREPARE a transaction, and then go into the data directory and
corrupt the 2PC state file using dd, and then try to commit it, it
might fail. But no system can survive that sort of thing, whether 2PC
is involved or not; in such extraordinary situations, of course
operator intervention will be required. But in a more normal
situation where you just have a failover, if the failover causes your
prepared transaction to come unprepared, that means your failover
mechanism is broken. If you're using synchronous replication, this
shouldn't happen.

There could be as well failures
at hardware-level, so you would need a mechanism ensuring that WAL is
consistent among all the nodes, with for example the addition of a
common restore point on all the nodes once PREPARE is successfully
done with for example XLOG_RESTORE_POINT. That's a reason why I think
that the local Coordinator should use 2PC as well, to ensure a
consistency point once all the remote nodes have successfully
PREPAREd, and a reason why things can get complicated for either the
DBA or the upper application in charge of ensuring the DB consistency
even in case of critical failures.

It's up to the DBA to decide whether they care about surviving
complete loss of a node while having 2PC still work. If they do, they
should use sync rep, and they should be fine -- the machine on which
the transaction is prepared shouldn't acknowledge the PREPARE as
having succeeded until the WAL is safely on disk on the standby. Most
probably don't, though; that's a big performance penalty.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 11 years ago

In reply to: Robert Haas (#25)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

A. Steps during transaction processing
------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed. Rest of the discussion
pertains to a transaction involving more than one foreign servers.
At the commit or abort time, the FDW receives call backs with the
appropriate events. FDW then takes following actions on each event.

2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction
id per foreign server involved and saves it along with xid, dbid, foreign
server id and user mapping and foreign transaction status = PREPARING
in-memory. The prepared transaction id can be anything represented as byte
string. Same information is flushed to the disk to survive crashes. This is
implemented in the patch as prepare_foreign_xact(). Persistent and
in-memory storages and their usages are discussed later in the mail. FDW
then prepares the transaction on the foreign server. If this step is
successful, the foreign transaction status is changed to PREPARED. If the
step is unsuccessful, the local transaction is aborted and each FDW will
receive XACT_EVENT_ABORT (discussed later). The updates to the foreign
transaction status need not be flushed to the disk, as they can be inferred
from the status of local transaction.

3. If the local transaction is committed, the FDW callback will get
XACT_EVENT_COMMIT event. Foreign transaction status is changed to
COMMITTING. FDW tries to commit the foreign transaction with the prepared
transaction id. If the commit is successful, the foreign transaction entry
is removed. If the commit is unsuccessful because of local/foreign server
crash or network failure, the foreign prepared transaction resolver takes
care of the committing it at later point of time.

4. If the local transaction is aborted, the FDW callback will get
XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared
a transaction on foreign server as per step 1 above. If it has not prepared
the transaction, it simply aborts the transaction on foreign server; a
server crash or network failure doesn't alter the ultimate result in this
case. If FDW has prepared the foreign transaction, it updates the foreign
transaction status as ABORTING and tries to rollback the prepared
transaction. If the rollback is successful, the foreign transaction entry
is removed. If the rollback is not successful, the foreign prepared
transaction resolver takes care of aborting it at later point of time.

B. Foreign prepared transaction resolver
---------------------------------------------------
In the patch this is implemented as a built-in function pg_fdw_resolve().
Ideally the functionality should be run by a background worker process
frequently.

The resolver looks at each entry and invokes the FDW routine to resolve the
transaction. The FDW routine returns boolean status: true if the prepared
transaction was resolved (committed/aborted), false otherwise.
The resolution is as follows -
1. If foreign transaction status is COMMITTING or ABORTING, commits or
aborts the prepared transaction resp through the FDW routine. If the
transaction is successfully resolved, it removes the foreign transaction
entry.
2. Else, it checks if the local transaction was committed or aborted, it
update the foreign transaction status accordingly and takes the action
according to above step 1.
3. The resolver doesn't touch entries created by in-progress local
transactions.

If server/backend crashes after it has registered the foreign transaction
entry (during step A.1), we will be left with a prepared transaction id,
which was never prepared on the foreign server. Similarly the
server/backend crashes after it has resolved the foreign prepared
transaction but before removing the entry, same situation can arise. FDW
should detect these situations, when foreign server complains about
non-existing prepared transaction ids and consider such foreign
transactions as resolved.

After looking at all the entries the resolver flushes the entries to the
disk, so as to retain the latest status across shutdown and crash.

C. Other methods and infrastructure
------------------------------------------------
1. Method to show the current foreign transaction entries (in progress or
waiting to be resolved). Implemented as function pg_fdw_xact() in the patch.
2. Method to drop foreign transaction entries in case they are resolved by
user/DBA themselves. Not implemented in the patch.
3. Method to prevent altering or dropping foreign server and user mapping
used to prepare the foreign transaction till the later gets resolved. Not
implemented in the patch. While altering or dropping the foreign server or
user mapping, that portion of the code needs to check if there exists an
foreign transaction entry depending upon the foreign server or user mapping
and should error out.
4. The information about the xid needs to be available till it is decided
whether to commit or abort the foreign transaction and that decision is
persisted. That should put some constraint on the xid wraparound or oldest
active transaction. Not implemented in the patch.
5. Method to propagate the foreign transaction information to the slave.

D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.

2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

A mixed approach where the backend shifts the entries from storage in
approach 1 to catalog, thus lifting the constraints on size is possible,
but is very complicated.

Any other ideas to use catalog table as the persistent storage here? Does
anybody think, catalog table is a viable option?

3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

The algorithms rely on the FDWs to take right steps to the large extent,
rather than controlling each step explicitly. It expects the FDWs to take
the right steps for each event and call the right functions to manipulate
foreign transaction entries. It does not ensure the correctness of these
steps, by say examining the foreign transaction entries in response to each
event or by making the callback return the information and manipulate the
entries within the core. I am willing to go the stricter but more intrusive
route if the others also think that way. Otherwise, the current approach is
less intrusive and I am fine with that too.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchtext/x-patch; charset=US-ASCII; name=pg_fdw_transact.patchDownload

diff --git a/contrib/pg_xlogdump/rmgrdesc.c b/contrib/pg_xlogdump/rmgrdesc.c
index 180818d..2fc6d82 100644
--- a/contrib/pg_xlogdump/rmgrdesc.c
+++ b/contrib/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
 	{ name, desc, identify},
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4e02cb2..3653c58 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -42,45 +44,51 @@ typedef struct ConnCacheKey
 } ConnCacheKey;
 
 typedef struct ConnCacheEntry
 {
 	ConnCacheKey key;			/* hash key (must be first) */
 	PGconn	   *conn;			/* connection to foreign server, or NULL */
 	int			xact_depth;		/* 0 = no xact open, 1 = main xact open, 2 =
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	char		*prep_xact_name;	/* Name of prepared transaction on this connection */
+	int			prep_xact_id;
 } ConnCacheEntry;
 
 /*
  * Connection cache (initialized on first use)
  */
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool is_server_twophase_compliant(ForeignServer *server);
+static void prepare_foreign_xact(ConnCacheEntry *entry, char *prep_xact_name);
+static void resolve_foreign_xact(ConnCacheEntry *entry, int action);
+static char *pgfdw_get_prep_xact_id(ConnCacheEntry *entry);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
@@ -88,21 +96,21 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -116,38 +124,36 @@ GetConnection(ForeignServer *server, UserMapping *user,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
 		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
 		/* initialize new hashtable entry (key is already filled in) */
 		entry->conn = NULL;
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
 	 * connection is actually used.
 	 */
 
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
@@ -155,26 +161,33 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
 		entry->conn = connect_pg_server(server, user);
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
+		entry->prep_xact_name = NULL;
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
@@ -362,29 +375,38 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is needed. The function would throw error, if the
+		 * transaction involves multiple foreign server and the one being
+		 * registered does not support 2PC. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid,
+									is_server_twophase_compliant(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,20 +528,50 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
+ * pgfdw_get_prep_xact_id
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is part
+ * of the extension not the core.
+ */
+static char *
+pgfdw_get_prep_xact_id(ConnCacheEntry *entry)
+{
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_xact_name;
+	MemoryContext	old_context;
+
+	old_context = MemoryContextSwitchTo(CacheMemoryContext);
+	/* Allocate the memory in the same context as the hash entry */
+	prep_xact_name = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	MemoryContextSwitchTo(old_context);
+	snprintf(prep_xact_name, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								entry->key.serverid, entry->key.userid);
+	return prep_xact_name;
+}
+/*
  * pgfdw_xact_callback --- cleanup at main-transaction end.
  */
 static void
 pgfdw_xact_callback(XactEvent event, void *arg)
 {
 	HASH_SEQ_STATUS scan;
 	ConnCacheEntry *entry;
 
 	/* Quick exit if no connections were touched in this transaction. */
 	if (!xact_got_connection)
@@ -540,22 +592,38 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 
 		/* If it has an open remote transaction, try to close it */
 		if (entry->xact_depth > 0)
 		{
 			elog(DEBUG3, "closing remote transaction on connection %p",
 				 entry->conn);
 
 			switch (event)
 			{
 				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
+					/* 
+					 * If the local server requires two phase commit (because
+					 * there are more than one foreign servers involved in the
+					 * transaction), execute the first phase i.e. prepare the
+					 * transaction on the foreign server. The server will tell
+					 * us whether to commit or rollback that prepared
+					 * transaction.
+					 * Otherwise, commit the transaction. If the COMMIT fails,
+					 * the local transaction will be rolled back.
+					 */
+
+					if (FdwTwoPhaseNeeded())
+						prepare_foreign_xact(entry, pgfdw_get_prep_xact_id(entry));
+					else
+					{
+						do_sql_command(entry->conn, "COMMIT TRANSACTION");
+						entry->xact_depth = 0;
+					}
 
 					/*
 					 * If there were any errors in subtransactions, and we
 					 * made prepared statements, do a DEALLOCATE ALL to make
 					 * sure we get rid of all prepared statements. This is
 					 * annoying and not terribly bulletproof, but it's
 					 * probably not worth trying harder.
 					 *
 					 * DEALLOCATE ALL only exists in 8.3 and later, so this
 					 * constrains how old a server postgres_fdw can
@@ -567,86 +635,107 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 					 */
 					if (entry->have_prep_stmt && entry->have_error)
 					{
 						res = PQexec(entry->conn, "DEALLOCATE ALL");
 						PQclear(res);
 					}
 					entry->have_prep_stmt = false;
 					entry->have_error = false;
 					break;
 				case XACT_EVENT_PRE_PREPARE:
-
 					/*
 					 * We disallow remote transactions that modified anything,
 					 * since it's not very reasonable to hold them open until
 					 * the prepared transaction is committed.  For the moment,
 					 * throw error unconditionally; later we might allow
 					 * read-only cases.  Note that the error will cause us to
 					 * come right back here with event == XACT_EVENT_ABORT, so
 					 * we'll clean up the connection state at that point.
 					 */
 					ereport(ERROR,
 							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 							 errmsg("cannot prepare a transaction that modified remote tables")));
 					break;
 				case XACT_EVENT_COMMIT:
+					/*
+					 * The local transaction has committed. If we prepared a
+					 * transaction on a foreign server, commit it.
+					 */
+					if (entry->prep_xact_name)
+					{
+						resolve_foreign_xact(entry, FDW_XACT_COMMITTING);
+						entry->xact_depth = 0;
+					}
+					else
+						/* Pre-commit should have closed the open transaction */
+						elog(ERROR, "missed cleaning up connection during pre-commit");
+
+					break;
 				case XACT_EVENT_PREPARE:
 					/* Pre-commit should have closed the open transaction */
 					elog(ERROR, "missed cleaning up connection during pre-commit");
 					break;
 				case XACT_EVENT_ABORT:
 					/* Assume we might have lost track of prepared statements */
 					entry->have_error = true;
 					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
+					if (entry->prep_xact_name)
+					{
+						/*
+						 * We have prepared transaction on the remote server,
+						 * roll it back.
+						 */
+						resolve_foreign_xact(entry, FDW_XACT_ABORTING);
+						entry->xact_depth = 0;
+					}
 					else
 					{
+						res = PQexec(entry->conn, "ABORT TRANSACTION");
+						entry->xact_depth = 0;
+						/* Note: can't throw ERROR, it would be infinite loop */
+						if (PQresultStatus(res) != PGRES_COMMAND_OK)
+							pgfdw_report_error(WARNING, res, entry->conn, true,
+											   "ABORT TRANSACTION");
+					}
+					/* As above, make sure to clear any prepared stmts */
+					if (entry->have_prep_stmt && entry->have_error)
+					{
+						res = PQexec(entry->conn, "DEALLOCATE ALL");
 						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
 					}
+					entry->have_prep_stmt = false;
+					entry->have_error = false;
 					break;
 			}
 		}
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
-
 		/*
 		 * If the connection isn't in a good idle state, discard it to
 		 * recover. Next GetConnection will open a new connection.
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
 			PQfinish(entry->conn);
 			entry->conn = NULL;
 		}
 	}
 
 	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * TODO:
+	 * With 2PC this optimization needs to be revised.
+	 * If we have aborted or committed the transaction, we can now mark ourselves
+	 * as out of the transaction.
 	 */
-	xact_got_connection = false;
+	if (event != XACT_EVENT_PRE_COMMIT)  
+		xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
 pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
@@ -705,10 +794,109 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * is_server_twophase_compliant
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+is_server_twophase_compliant(ForeignServer *server)
+{
+	/* 
+	 * TODO:
+	 * Do we need to check whether the server passed in is indeed
+	 * PostgreSQL server? Probably not.
+	 */
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "twophase_compliant") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
+
+/*
+ * prepare_foreign_xact
+ * Prepare the transaction on the foreign server indicated by the entry with
+ * passed in GID.
+ */
+static void
+prepare_foreign_xact(ConnCacheEntry *entry, char *prep_xact_name)
+{
+	StringInfo	command = makeStringInfo();
+	PGresult   *res;
+
+	entry->prep_xact_id = insert_fdw_xact(MyDatabaseId, GetCurrentTransactionId(),
+										entry->key.serverid, entry->key.userid,
+										strlen(prep_xact_name) + 1,
+										prep_xact_name, FDW_XACT_PREPARING);
+	appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_xact_name);
+	res = PQexec(entry->conn, command->data);
+	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	{
+		/*
+		 * An error occured, and we didn't prepare the transaction. Delete the
+		 * entry from foreign transaction table. Raise an error, so that the
+		 * local server knows that one of the foreign server has failed to
+		 * prepare the transaction.
+		 */
+		remove_fdw_xact(entry->prep_xact_id);
+		pgfdw_report_error(ERROR, res, entry->conn, true, command->data);
+	}
+
+	PQclear(res);
+	/* Preparation succeeded, remember that we have prepared a transaction */
+	entry->prep_xact_name = prep_xact_name;
+	/* The transaction got prepared, register this fact */
+	update_fdw_xact(entry->prep_xact_id, FDW_XACT_PREPARED);
+}
+
+static void
+resolve_foreign_xact(ConnCacheEntry *entry, int action)
+{
+	StringInfo	command = makeStringInfo();
+	PGresult	*res;
+
+	/* Remember transaction resolution */
+	update_fdw_xact(entry->prep_xact_id, action);
+	if (action == FDW_XACT_COMMITTING)
+		appendStringInfo(command, "COMMIT PREPARED '%s'",
+								entry->prep_xact_name);
+	else if (action == FDW_XACT_ABORTING)
+		appendStringInfo(command, "ROLLBACK PREPARED '%s'",
+								entry->prep_xact_name);
+	else
+		elog(ERROR, "Wrong action %d while resolving foreign transaction", action);
+
+	res = PQexec(entry->conn, command->data);
+	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	{
+		/*
+		 * The prepared foreign transaction couldn't be resolved. It will be
+		 * resolved later when pg_fdw_resolve() gets called.
+		 */
+	}
+	else
+		/* We succeeded in resolving the transaction, update the information */
+		remove_fdw_xact(entry->prep_xact_id);
+
+	PQclear(res);
+	pfree(entry->prep_xact_name);
+	entry->prep_xact_name = NULL;
+	return;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 583cce7..50b60ea 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1,20 +1,21 @@
 -- ===================================================================
 -- create FDW objects
 -- ===================================================================
 CREATE EXTENSION postgres_fdw;
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
 CREATE TYPE user_enum AS ENUM ('foo', 'bar', 'buz');
@@ -3249,10 +3250,133 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Next test shows local transaction fails if "any" of the remote transactions
+-- fail to commit. But any COMMITted transaction on the remote servers remains
+-- COMMITTED.
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (5);
+ERROR:  Detected Two Phase Commit incapable foreign servers in a transaction involving multiple foreign servers.
+COMMIT TRANSACTION;
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..ed956ab 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -98,21 +98,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "twophase_compliant") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -146,20 +147,22 @@ InitPgFdwOptions(void)
 		{"column_name", AttributeRelationId, false},
 		/* use_remote_estimate is available on both server and table */
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"twophase_compliant", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d76e739..c2ebeec 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,21 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -281,20 +282,22 @@ static void postgresExplainForeignScan(ForeignScanState *node,
 static void postgresExplainForeignModify(ModifyTableState *mtstate,
 							 ResultRelInfo *rinfo,
 							 List *fdw_private,
 							 int subplan_index,
 							 ExplainState *es);
 static bool postgresAnalyzeForeignTable(Relation relation,
 							AcquireSampleRowsFunc *func,
 							BlockNumber *totalpages);
 static List *postgresImportForeignSchema(ImportForeignSchemaStmt *stmt,
 							Oid serverOid);
+static bool postgresResolvePreparedTransaction(Oid serveroid, Oid userid, int prep_xact_len,
+									char *prep_xact_name, int xact_resolution);
 
 /*
  * Helper functions
  */
 static void estimate_path_cost_size(PlannerInfo *root,
 						RelOptInfo *baserel,
 						List *join_conds,
 						double *p_rows, int *p_width,
 						Cost *p_startup_cost, Cost *p_total_cost);
 static void get_remote_estimate(const char *sql,
@@ -361,20 +364,22 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for resolving transactions */
+	routine->ResolvePreparedTransaction = postgresResolvePreparedTransaction;
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -912,21 +917,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1297,21 +1302,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1747,21 +1752,21 @@ estimate_path_cost_size(PlannerInfo *root,
 		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
 						 &retrieved_attrs);
 		if (fpinfo->remote_conds)
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2311,21 +2316,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2403,21 +2408,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2605,21 +2610,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2963,10 +2968,62 @@ static void
 conversion_error_callback(void *arg)
 {
 	ConversionLocation *errpos = (ConversionLocation *) arg;
 	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
 		errcontext("column \"%s\" of foreign table \"%s\"",
 				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
 				   RelationGetRelationName(errpos->rel));
 }
+
+/*
+ * postgresResolvePreparedTransaction
+ * Resolve (COMMIT/ABORT) a prepared transaction on the foreign server.
+ * Returns true if resolution is successful, false otherwise.
+ */
+extern bool 
+postgresResolvePreparedTransaction(Oid serveroid, Oid userid, int prep_xact_len,
+									char *prep_xact_name, int xact_resolution)
+{
+	ForeignServer	*foreign_server = GetForeignServer(serveroid); 
+	UserMapping		*user_mapping = GetUserMapping(userid, serveroid);
+	PGconn			*conn = GetConnection(foreign_server, user_mapping, false, false);
+	StringInfo		command = makeStringInfo();
+	PGresult		*res;
+
+	/*
+	 * TODO: This connection shouldn't be doing any active transaction on the
+	 * foreign server. How do we ensure that?
+	 */
+	if (xact_resolution == FDW_XACT_COMMITTING)
+		appendStringInfo(command, "COMMIT PREPARED '%s'", prep_xact_name); 
+	else if (xact_resolution == FDW_XACT_ABORTING)
+		appendStringInfo(command, "ROLLBACK PREPARED '%s'", prep_xact_name); 
+
+	res = PQexec(conn, command->data);
+	ReleaseConnection(conn);
+	if (PQresultStatus(res) == PGRES_COMMAND_OK) 
+	{
+		PQclear(res);
+		return true;
+	}
+
+	/*
+	 * TODO: need to work out a macro for error code, rather than hard coded
+	 * value here. We can't use ERRCODE_UNDEFINED_OBJECT directly, it's way different
+	 * than the integer value.
+	 */
+	if (atoi(PQresultErrorField(res, PG_DIAG_SQLSTATE)) == 42704)
+	{
+		/*
+		 * The prepared transaction name couldn't be identified, probably
+		 * resolved.
+		 */
+		PQclear(res);
+		return true;
+	}
+
+	/* For anything else, return failed status */
+	PQclear(res);
+	return false;
+}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 950c6f7..f446c90 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -19,21 +19,21 @@
 #include "utils/relcache.h"
 
 #include "libpq-fe.h"
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 83e8fa7..95b940e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2,21 +2,22 @@
 -- create FDW objects
 -- ===================================================================
 
 CREATE EXTENSION postgres_fdw;
 
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -714,10 +715,106 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+-- Next test shows local transaction fails if "any" of the remote transactions
+-- fail to commit. But any COMMITted transaction on the remote servers remains
+-- COMMITTED.
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (5);
+COMMIT TRANSACTION;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 9d4d5db..b43ca07 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..6394f19
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,464 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <unistd.h>
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "storage/lock.h"
+#include "storage/fd.h"
+#include "storage/procarray.h"
+#include "utils/builtins.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+
+#define FDW_XACT_FILE_NAME	"fdw_xact"
+
+typedef struct
+{
+	bool		read_from_disk;
+	/*
+	 * TODO:
+	 * We should augment this structure with CRC checksum to validate the
+	 * contents of the file.
+	 */
+	FdwXactData fdw_xacts[1];	/* Variable length array */
+} FdwXactGlobalData;
+
+/*
+ * TODO:
+ * This should be turned into a GUC. If we do so, the size also needs to be
+ * recorded in the file so that a change in the GUC value can be noticed.
+ */
+int	max_fdw_xacts = 100;
+
+static FdwXactGlobalData	*FdwXactGlobal;
+
+static int search_free_fdwxact();
+
+/* TODO: we should do something better to reduce lock conflicts */
+/*
+ * Initialization of shared memory
+ */
+extern Size
+FdwXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FdwXactGlobalData, fdw_xacts);
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FdwXactData)));
+
+	return size;
+}
+
+extern void
+FdwXactShmemInit(void)
+{
+	bool		found;
+
+	FdwXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FdwXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FdwXactGlobal->read_from_disk = true;
+		Assert(!found);
+	}
+	else
+	{
+		Assert(FdwXactGlobal);
+		Assert(found);
+	}
+}
+
+/* 
+ * read_fdw_xact_file
+ * Read the information about the prepared foreign transactions from the disk.
+ * The in-memory copy of the file is upto date always except for the first read
+ * from the disk after server boot. If we read the file at the time of
+ * initialising the memory, we can be always sure that the in-memory copy of the
+ * foreign transaction information is the latest one. This won't need any
+ * special status for unread file.
+ */
+static void
+read_fdw_xact_file()
+{
+	/*
+	 * If we haven't read the file containing information about FDW
+	 * transactions, read it.
+	 */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	if (FdwXactGlobal->read_from_disk)
+	{
+		int	fd = OpenTransientFile(FDW_XACT_FILE_NAME, O_EXCL | PG_BINARY | O_RDONLY, 0);
+		int	read_size;
+
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not open FDW transaction state file \"%s\": %m",
+						FDW_XACT_FILE_NAME)));
+
+		/* TODO:
+		 * If we turn the max number of foreign transactions into a GUC, we will
+		 * need to make sure that the changed value has enough space to load the
+		 * file.
+		 */
+		read_size = read(fd, FdwXactGlobal, FdwXactShmemSize());
+		CloseTransientFile(fd);
+		
+		/* If there is nothing in the file, 0 out the memory */
+		if (read_size == 0)
+			memset(FdwXactGlobal, 0, FdwXactShmemSize());
+
+		FdwXactGlobal->read_from_disk = false;
+	}
+	LWLockRelease(FdwXactLock);
+	return;
+}
+
+static void
+flush_fdw_xact_file()
+{
+	int	fd;
+	fd = OpenTransientFile(FDW_XACT_FILE_NAME, O_EXCL | PG_BINARY | O_WRONLY, 0);
+	/*
+	 * TODO: We might come out of this without writing, in case the process is
+	 * interrupted. Take care of this case; check for EINTR error value.
+	 * Do we need to fsync this information or the subsequent close would do
+	 * that?
+	 */
+	if (write(fd, FdwXactGlobal, FdwXactShmemSize()) != FdwXactShmemSize())
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", FDW_XACT_FILE_NAME)));
+	}
+	CloseTransientFile(fd);
+	return;
+}
+
+/*
+ * search_free_fdwxact
+ * Search for a free slot in FdwXactGlobal array
+ * The caller is expected to hold the lock on this array.
+ */
+static int 
+search_free_fdwxact()
+{
+	int	ret_id;
+	for (ret_id = 0; ret_id < max_fdw_xacts; ret_id++)
+	{
+		FdwXactData *fdw_xact = &(FdwXactGlobal->fdw_xacts[ret_id]);
+		/* A slot with invalid DBID is considered as free slot */
+		if (!OidIsValid(fdw_xact->dboid))
+			return ret_id;	
+	}
+	/* If we reached here, every slot is filled, throw an error */
+	elog(ERROR, "Limit of foreign prepared transactions exceeded");
+	/* Keep the compiler happy */
+	return -1;
+}
+
+extern int 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id, int fdw_xact_status)
+{
+	int	ret_id;
+	FdwXactData	*fdw_xact;
+
+	read_fdw_xact_file();
+	
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	ret_id = search_free_fdwxact();
+
+	fdw_xact = &(FdwXactGlobal->fdw_xacts[ret_id]);
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = foreign_server;
+	fdw_xact->userid = userid;
+	fdw_xact->prep_name_len = fdw_xact_id_len;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	memcpy(fdw_xact->prep_xact_info, fdw_xact_id, fdw_xact->prep_name_len);
+	/*
+	 * Write the file to the disk, so that the fact that we performed some
+	 * transaction on this server survives a subsequent crash.
+	 */
+	flush_fdw_xact_file();
+	LWLockRelease(FdwXactLock);
+	return ret_id;
+}
+
+extern int
+update_fdw_xact(int fdw_xact_id, int fdw_xact_status)
+{
+	FdwXactData 	*fdw_xact;
+	read_fdw_xact_file();
+	/* TODO: Validate fdw_xact_id? */
+	/* TODO: Do we need to take a lock here? Probably not. The only process that
+	 * updates the this entry is the one which created it or later one which
+	 * resolves it. Those two can not run in parallel.
+	 */
+	fdw_xact = &(FdwXactGlobal->fdw_xacts[fdw_xact_id]);
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	/*
+	 * We don't need to flush the file on every update. Status of local
+	 * transaction is enough to infer what should happen to the foreign
+	 * transaction.
+	 */
+	return fdw_xact_id;
+}
+
+extern void 
+remove_fdw_xact(int fdw_xact_id)
+{
+	FdwXactData 	*fdw_xact;
+	/* Read if the file has not already been read */
+	read_fdw_xact_file();
+	/* TODO: Validate fdw_xact_id? */
+	/* The resolver process or the process which created this entry, both can
+	 * try to delete this entry simultaneously or the search can read this entry
+	 * while being deleted and grab it prematurely.
+	 */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = &(FdwXactGlobal->fdw_xacts[fdw_xact_id]);
+	fdw_xact->dboid = InvalidOid;
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FdwXactGlobalData struct definition.
+ *
+ * TODO:
+ * Like pg_prepared_xact() we should create a working set to take care of
+ * synchronization. 
+ */
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	int				cur;
+
+	read_fdw_xact_file();
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_prepared_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "database oid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "local transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "foreign server oid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "user mapping oid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   INT8OID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "info",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/* We always start from the first entry */
+		funcctx->user_fctx = (void *) 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	cur = (int) (int *)funcctx->user_fctx;
+
+	/*
+	 * TODO:
+	 * We should really take lock here OR take a snapshot of foreign
+	 * transactions data.
+	 */
+	while (cur < max_fdw_xacts)
+	{
+		FdwXactData	*fdw_xact = &(FdwXactGlobal->fdw_xacts[cur]);
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		/* Skip the empty slots */
+		if (!OidIsValid(fdw_xact->dboid))
+		{
+			cur++;
+			continue;
+		}
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		/* TODO: this should be really interpreted by FDW */
+		values[4] = Int8GetDatum(fdw_xact->fdw_xact_status);
+		values[5] = CStringGetTextDatum(fdw_xact->prep_xact_info);
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		cur++;
+		funcctx->user_fctx = (void *)(cur);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	int	cnt_xact;
+	read_fdw_xact_file();
+	/* Scan all the foreign transaction */
+	for (cnt_xact = 0; cnt_xact < max_fdw_xacts; cnt_xact++)
+	{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+		bool				resolved;
+		FdwXactData			*fdw_xact;
+
+		/*
+		 * TODO: we need to make sure that no one is modifying this entry, while
+		 * we try to resolve it. Following sequence might lead to inconsistent
+		 * entry
+		 * 1. Backend which created this entry marked it as COMMITTING or
+		 * ABORTING
+		 * 2. this function chose to resolve the same entry in some other
+		 * backend (there can be multiple such invocation if we keep it in the
+		 * form of a built-in.
+		 *
+		 * That's a race condition. But possibly there is no hazard except the
+		 * extra messages to the foreign server, as explained below.
+		 * Both of these backends will send corresponding messages to the
+		 * foreign server, one succeeding and other getting an error. Eventual
+		 * result will be that the entry will be removed by one and other will
+		 * leave it untouched. 
+		 */
+		fdw_xact = &(FdwXactGlobal->fdw_xacts[cnt_xact]);
+		
+		/*
+		 * Leave empty slots aside. Leave the entries, whose foreign servers are
+		 * not part of the database where this function was called. We can not
+		 * get information about such foreign servers.
+		 */
+		if (!OidIsValid(fdw_xact->dboid) ||
+				fdw_xact->dboid != MyDatabaseId)
+			continue;
+
+		else if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING ||
+				fdw_xact->fdw_xact_status == FDW_XACT_ABORTING)
+		{
+			/*
+			 * We have already decided what to do with the foreing transaction
+			 * nothing to be done.
+			 */
+		}
+		else if (TransactionIdDidCommit(fdw_xact->local_xid))
+		{
+			/* TODO: we should inform the user through warning or error, and let
+			 * him deal when Assert conditions fail.
+			 */
+			Assert(fdw_xact->fdw_xact_status != FDW_XACT_ABORTING);
+			update_fdw_xact(cnt_xact, FDW_XACT_COMMITTING);
+		}
+		else if (TransactionIdDidAbort(fdw_xact->local_xid))
+		{
+			/* TODO: we should inform the user through warning or error, and let
+			 * him deal when Assert conditions fail.
+			 */
+			Assert(fdw_xact->fdw_xact_status != FDW_XACT_COMMITTING);
+			update_fdw_xact(cnt_xact, FDW_XACT_ABORTING);
+		}
+		else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+		{
+			/*
+			 * The transaction is in progress but not on any of the backends. So
+			 * probably, it crashed before actual abort or commit. So assume it
+			 * to be aborted.
+			 * TODO: In HeapTupleSatisfiesUpdate() any transaction which is not
+			 * TransactionIdIsInProgress() and TransactionIdDidCommit() is
+			 * considered aborted. Can we do the same here?
+			 */
+			/* TODO: we should inform the user through warning or error, and let
+			 * him deal when Assert conditions fail.
+			 */
+			Assert(fdw_xact->fdw_xact_status != FDW_XACT_COMMITTING);
+			update_fdw_xact(cnt_xact, FDW_XACT_ABORTING);
+		}
+		else
+		{
+			/* The local transaction is still in progress */
+			continue;
+		}
+
+		/* Onwards we deal only with resolvable transactions */
+		/* Get the FDW hook to resolve the transaction */
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		Assert(fdw_routine->ResolvePreparedTransaction);
+		resolved = fdw_routine->ResolvePreparedTransaction(fdw_xact->serveroid,
+												fdw_xact->userid,
+												fdw_xact->prep_name_len,
+												fdw_xact->prep_xact_info,
+												fdw_xact->fdw_xact_status);
+		
+		/* If we succeeded in resolving the transaction, remove the entry */
+		if (resolved)
+			remove_fdw_xact(cnt_xact);
+	}
+	
+	/* Flush the foreign transaction udpates to the disk */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	flush_fdw_xact_file();
+	LWLockRelease(FdwXactLock);
+}
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 97000ef..76c16d6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -146,20 +146,24 @@ typedef struct TransactionStateData
 	ResourceOwner curTransactionOwner;	/* my query resources */
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -175,21 +179,23 @@ static TransactionStateData TopTransactionStateData = {
 	NULL,						/* cur transaction context */
 	NULL,						/* cur transaction resource owner */
 	NULL,						/* subcommitted child Xids */
 	0,							/* # of subcommitted child Xids */
 	0,							/* allocated size of childXids[] */
 	InvalidOid,					/* previous CurrentUserId setting */
 	0,							/* previous SecurityRestrictionContext */
 	false,						/* entry-time xact r/o state */
 	false,						/* startedInRecovery */
 	false,						/* didLogXid */
-	NULL						/* link to parent state block */
+	NULL,						/* link to parent state block */
+	0,							/* number of foreign servers participating in the transaction */
+	true						/* By default all the foreign server are capable of 2PC */
 };
 
 /*
  * unreportedXids holds XIDs of all subtransactions that have not yet been
  * reported in a XLOG_XACT_ASSIGNMENT record.
  */
 static int	nUnreportedXids;
 static TransactionId unreportedXids[PGPROC_MAX_CACHED_SUBXIDS];
 
 static TransactionState CurrentTransactionState = &TopTransactionStateData;
@@ -1807,20 +1813,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -4971,10 +4980,39 @@ xact_redo(XLogReaderState *record)
 	{
 		xl_xact_assignment *xlrec = (xl_xact_assignment *) XLogRecGetData(record);
 
 		if (standbyState >= STANDBY_INITIALIZED)
 			ProcArrayApplyXidAssignment(xlrec->xtop,
 										xlrec->nsubxacts, xlrec->xsub);
 	}
 	else
 		elog(PANIC, "xact_redo: unknown op code %u", info);
 }
+
+extern void
+RegisterXactForeignServer(Oid serveroid, bool can_prepare)
+{
+	TransactionState	top_xact_state = &TopTransactionStateData;
+	top_xact_state->num_foreign_servers++;
+
+	if (top_xact_state->num_foreign_servers == 1)
+		top_xact_state->can_prepare = can_prepare;
+
+	top_xact_state->can_prepare = top_xact_state->can_prepare && can_prepare;
+
+	/*
+	 * If multiple foreign servers are involved in the transaction and any one
+	 * of them is not capable of 2PC
+	 */
+	if (top_xact_state->num_foreign_servers > 1 &&
+		!top_xact_state->can_prepare)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("Detected Two Phase Commit incapable foreign servers in a transaction involving multiple foreign servers.")));
+}
+
+extern bool
+FdwTwoPhaseNeeded()
+{
+	TransactionState	top_xact_state = &TopTransactionStateData;
+	return (top_xact_state->num_foreign_servers > 1);
+}
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 16b9808..9b8893b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -130,20 +131,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, PMSignalShmemSize());
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FdwXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -240,20 +242,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationSlotsShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FdwXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 18614e7..53a6047 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -196,21 +196,21 @@ static const char *subdirs[] = {
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
@@ -225,20 +225,21 @@ static void pre_sync_fname(char *fname, bool isdir);
 static void fsync_fname(char *fname, bool isdir);
 static FILE *popen_check(const char *command, const char *mode);
 static void exit_nicely(void);
 static char *get_id(void);
 static char *get_encoding_id(char *encoding_name);
 static bool mkdatadir(const char *subdir);
 static void set_input(char **dest, char *filename);
 static void check_input(char *path);
 static void write_version_file(char *extrapath);
 static void set_null_conf(void);
+static void set_fdw_xact_file(void);
 static void test_config_settings(void);
 static void setup_config(void);
 static void bootstrap_template1(void);
 static void setup_auth(void);
 static void get_set_pwd(void);
 static void setup_depend(void);
 static void setup_sysviews(void);
 static void setup_description(void);
 static void setup_collation(void);
 static void setup_conversion(void);
@@ -1088,20 +1089,46 @@ set_null_conf(void)
 	if (fclose(conf_file))
 	{
 		fprintf(stderr, _("%s: could not write file \"%s\": %s\n"),
 				progname, path, strerror(errno));
 		exit_nicely();
 	}
 	free(path);
 }
 
 /*
+ * set up an empty config file so we can check config settings by launching
+ * a test backend
+ */
+static void
+set_fdw_xact_file(void)
+{
+	FILE	   *conf_file;
+	char	   *path;
+
+	path = psprintf("%s/fdw_xact", pg_data);
+	conf_file = fopen(path, PG_BINARY_W);
+	if (conf_file == NULL)
+	{
+		fprintf(stderr, _("%s: could not open file \"%s\" for writing: %s\n"),
+				progname, path, strerror(errno));
+		exit_nicely();
+	}
+	if (fclose(conf_file))
+	{
+		fprintf(stderr, _("%s: could not write file \"%s\": %s\n"),
+				progname, path, strerror(errno));
+		exit_nicely();
+	}
+	free(path);
+}
+/*
  * Determine which dynamic shared memory implementation should be used on
  * this platform.  POSIX shared memory is preferable because the default
  * allocation limits are much higher than the limits for System V on most
  * systems that support both, but the fact that a platform has shm_open
  * doesn't guarantee that that call will succeed when attempted.  So, we
  * attempt to reproduce what the postmaster will do when allocating a POSIX
  * segment in dsm_impl.c; if it doesn't work, we assume it won't work for
  * the postmaster either, and configure the cluster for System V shared
  * memory instead.
  */
@@ -3497,20 +3524,21 @@ initialize_data_directory(void)
 	}
 
 	check_ok();
 
 	/* Top level PG_VERSION is checked by bootstrapper, so make it first */
 	write_version_file(NULL);
 
 	/* Select suitable configuration settings */
 	set_null_conf();
 	test_config_settings();
+	set_fdw_xact_file();
 
 	/* Now create all the text config files */
 	setup_config();
 
 	/* Bootstrap template1 */
 	bootstrap_template1();
 
 	/*
 	 * Make the per-database PG_VERSION for template1 only after init'ing it
 	 */
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..be92eed
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,51 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#define MAX_PREP_XACT_INFO_LEN	256
+typedef struct
+{
+	Oid		dboid;				/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid		serveroid;			/* foreign server where transaction takes place */
+	Oid		userid;				/* user who initiated the foreign transaction */
+	int		fdw_xact_status;	/* The state of the foreign transaction */
+	uint8	prep_name_len;		/* Length of the prepared transaction data */ 
+	/* 
+	 * TODO: Restricting the size of prepared transaction information may not
+	 * suit all FDWs. One possible idea is to create a file for every foreign
+	 * transaction entry to contain the prepared transaction information
+	 * required by FDW. While resolving the transaction, the information from
+	 * the file would be passed to the FDW routine.
+	 */
+	char	prep_xact_info[MAX_PREP_XACT_INFO_LEN];	/* The prepared transaction data starts here */
+} FdwXactData;
+
+typedef enum
+{
+	FDW_XACT_UNKNOWN = 0,
+	FDW_XACT_PREPARING,
+	FDW_XACT_PREPARED,
+	FDW_XACT_COMMITTING,
+	FDW_XACT_ABORTING
+} FdwXactStatus;
+
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern int insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid user_mapping,
+							int fdw_xact_id_len, char *fdw_xact_id, int fdw_xact_status);
+extern int update_fdw_xact(int fdw_xact_id, int fdw_xact_status);
+extern void remove_fdw_xact(int fdw_xact_id);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 8205504..2adad46 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -252,12 +252,14 @@ extern bool IsInTransactionChain(bool isTopLevel);
 extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
 extern void xact_redo(XLogReaderState *record);
 extern void xact_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xact_identify(uint8 info);
+extern void RegisterXactForeignServer(Oid serveroid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded();
 
 #endif   /* XACT_H */
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 9edfdb8..a00afdb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5135,20 +5135,24 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 20 "2276" "{2276}" "{v}" _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4063 (  pg_fdw_xact PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,20,25}" "{o,o,o,o,o,o}" "{database oid, local transaction,foreign server oid,user mapping oid,status,info}" _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4071 (  pg_fdw_resolve PGNSP PGUID 12 1 1000 0 0 f f f f f f v 0 0 2278 "" _null_ _null_ _null_  _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
 
 
 /*
  * Symbolic values for provolatile column: these indicate whether the result
  * of a function is dependent *only* on the values of its explicit arguments,
  * or can change due to outside factors (such as parameter variables or
  * table contents).  NOTE: functions having side-effects, such as setval(),
  * must be labeled volatile to ensure they will not get optimized away,
  * even if the actual return value is not changeable.
  */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 1d76841..6211738 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -95,20 +95,23 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*ResolvePreparedTransaction_function) (Oid serverOid, Oid user_mapping,
+														int prep_info_len,
+														char *prep_info, int resolution);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -143,20 +146,23 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for prepared transaction resolution */
+	ResolvePreparedTransaction_function ResolvePreparedTransaction;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
 						 ImportForeignSchemaStmt *stmt);
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e3c2efc..da056b6 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -127,22 +127,23 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define SerializablePredicateLockListLock	(&MainLWLockArray[30].lock)
 #define OldSerXidLock				(&MainLWLockArray[31].lock)
 #define SyncRepLock					(&MainLWLockArray[32].lock)
 #define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
 #define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
+#define FdwXactLock					(&MainLWLockArray[40].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		40
+#define NUM_INDIVIDUAL_LWLOCKS		41
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
  * here, but we need them to figure out offsets within MainLWLockArray, and
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
 
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index bc4517d..4783c2b 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1216,11 +1216,14 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */

#27

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 11 years ago

In reply to: Ashutosh Bapat (#26)

Re: Transactions involving multiple postgres foreign servers

Added to 2015-06 commitfest to attract some reviews and comments.

On Tue, Feb 17, 2015 at 2:56 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

A. Steps during transaction processing
------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that
by looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the
form of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed. Rest of the discussion
pertains to a transaction involving more than one foreign servers.
At the commit or abort time, the FDW receives call backs with the
appropriate events. FDW then takes following actions on each event.

2. On XACT_EVENT_PRE_COMMIT event, the FDW coins one prepared transaction
id per foreign server involved and saves it along with xid, dbid, foreign
server id and user mapping and foreign transaction status = PREPARING
in-memory. The prepared transaction id can be anything represented as byte
string. Same information is flushed to the disk to survive crashes. This is
implemented in the patch as prepare_foreign_xact(). Persistent and
in-memory storages and their usages are discussed later in the mail. FDW
then prepares the transaction on the foreign server. If this step is
successful, the foreign transaction status is changed to PREPARED. If the
step is unsuccessful, the local transaction is aborted and each FDW will
receive XACT_EVENT_ABORT (discussed later). The updates to the foreign
transaction status need not be flushed to the disk, as they can be inferred
from the status of local transaction.

3. If the local transaction is committed, the FDW callback will get
XACT_EVENT_COMMIT event. Foreign transaction status is changed to
COMMITTING. FDW tries to commit the foreign transaction with the prepared
transaction id. If the commit is successful, the foreign transaction entry
is removed. If the commit is unsuccessful because of local/foreign server
crash or network failure, the foreign prepared transaction resolver takes
care of the committing it at later point of time.

4. If the local transaction is aborted, the FDW callback will get
XACT_EVENT_ABORT event. At this point, the FDW may or may not have prepared
a transaction on foreign server as per step 1 above. If it has not prepared
the transaction, it simply aborts the transaction on foreign server; a
server crash or network failure doesn't alter the ultimate result in this
case. If FDW has prepared the foreign transaction, it updates the foreign
transaction status as ABORTING and tries to rollback the prepared
transaction. If the rollback is successful, the foreign transaction entry
is removed. If the rollback is not successful, the foreign prepared
transaction resolver takes care of aborting it at later point of time.

B. Foreign prepared transaction resolver
---------------------------------------------------
In the patch this is implemented as a built-in function pg_fdw_resolve().
Ideally the functionality should be run by a background worker process
frequently.

The resolver looks at each entry and invokes the FDW routine to resolve
the transaction. The FDW routine returns boolean status: true if the
prepared transaction was resolved (committed/aborted), false otherwise.
The resolution is as follows -
1. If foreign transaction status is COMMITTING or ABORTING, commits or
aborts the prepared transaction resp through the FDW routine. If the
transaction is successfully resolved, it removes the foreign transaction
entry.
2. Else, it checks if the local transaction was committed or aborted, it
update the foreign transaction status accordingly and takes the action
according to above step 1.
3. The resolver doesn't touch entries created by in-progress local
transactions.

If server/backend crashes after it has registered the foreign transaction
entry (during step A.1), we will be left with a prepared transaction id,
which was never prepared on the foreign server. Similarly the
server/backend crashes after it has resolved the foreign prepared
transaction but before removing the entry, same situation can arise. FDW
should detect these situations, when foreign server complains about
non-existing prepared transaction ids and consider such foreign
transactions as resolved.

After looking at all the entries the resolver flushes the entries to the
disk, so as to retain the latest status across shutdown and crash.

C. Other methods and infrastructure
------------------------------------------------
1. Method to show the current foreign transaction entries (in progress or
waiting to be resolved). Implemented as function pg_fdw_xact() in the patch.
2. Method to drop foreign transaction entries in case they are resolved by
user/DBA themselves. Not implemented in the patch.
3. Method to prevent altering or dropping foreign server and user mapping
used to prepare the foreign transaction till the later gets resolved. Not
implemented in the patch. While altering or dropping the foreign server or
user mapping, that portion of the code needs to check if there exists an
foreign transaction entry depending upon the foreign server or user mapping
and should error out.
4. The information about the xid needs to be available till it is decided
whether to commit or abort the foreign transaction and that decision is
persisted. That should put some constraint on the xid wraparound or oldest
active transaction. Not implemented in the patch.
5. Method to propagate the foreign transaction information to the slave.

D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.

2. New catalog - This method takes out the need to have separate method
for C1, C5 and even C2, also the synchronization will be taken care of by
row locks, there will be no limit on the number of foreign transactions as
well as the size of foreign prepared transaction information. But big
problem with this approach is that, the changes to the catalogs are atomic
with the local transaction. If a foreign prepared transaction can not be
aborted while the local transaction is rolled back, that entry needs to
retained. But since the local transaction is aborting the corresponding
catalog entry would become invisible and thus unavailable to the resolver
(alas! we do not have autonomous transaction support). We may be able to
overcome this, by simulating autonomous transaction through a background
worker (which can also act as a resolver). But the amount of communication
and synchronization, might affect the performance.

A mixed approach where the backend shifts the entries from storage in
approach 1 to catalog, thus lifting the constraints on size is possible,
but is very complicated.

Any other ideas to use catalog table as the persistent storage here? Does
anybody think, catalog table is a viable option?

3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

The algorithms rely on the FDWs to take right steps to the large extent,
rather than controlling each step explicitly. It expects the FDWs to take
the right steps for each event and call the right functions to manipulate
foreign transaction entries. It does not ensure the correctness of these
steps, by say examining the foreign transaction entries in response to each
event or by making the callback return the information and manipulate the
entries within the core. I am willing to go the stricter but more intrusive
route if the others also think that way. Otherwise, the current approach is
less intrusive and I am fine with that too.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#28

Heikki Linnakangas

hlinnaka@iki.fi

over 10 years ago

In reply to: Ashutosh Bapat (#26)

Re: Transactions involving multiple postgres foreign servers

On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:

Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

Wow, this is going to be a lot of new infrastructure. This is going to
need good documentation, explaining how two-phase commit works in
general, how it's implemented, how to monitor it etc. It's important to
explain all the possible failure scenarios where you're left with
in-doubt transactions, and how the DBA can resolve them.

Since we're building a Transaction Manager into PostgreSQL, please put a
lot of thought on what kind of APIs it provides to the rest of the
system. APIs for monitoring it, configuring it, etc. And how an
extension could participate in a transaction, without necessarily being
an FDW.

Regarding the configuration, there are many different behaviours that an
FDW could implement:

1. The FDW is read-only. Commit/abort behaviour is moot.
2. Transactions are not supported. All updates happen immediately
regardless of the local transaction.
3. Transactions are supported, but two-phase commit is not. There are
three different ways we can use the remote transactions in that case:
3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do
safe two-phase commit using so-called Last Resource Optimization.
4. Full two-phases commit support

We don't necessarily have to support all of that, but let's keep all
these cases in mind when we design the how to configure FDWs. There's
more to it than "does it support 2PC".

A. Steps during transaction processing
------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed.

Just to be clear: you also need two-phase commit if the transaction
updated anything in the local server and in even one foreign server.

D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.

Your current approach with a file that's flushed to disk on every update
has a few problems. Firstly, it's not crash safe. Secondly, if you make
it crash-safe with fsync(), performance will suffer. You're going to
need to need several fsyncs per commit with 2PC anyway, there's no way
around that, but the scalable way to do that is to use the WAL so that
one fsync() can flush more than one update in one operation.

So I think you'll need to do something similar to the pg_twophase files.
WAL-log each update, and only flush the file/files to disk on a
checkpoint. Perhaps you could use the pg_twophase infrastructure for
this directly, by essentially treating every local transaction as a
two-phase transaction, with some extra flag to indicate that it's an
internally-created one.

2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

Right. The pg_twophase files solve that exact same issue.

There is clearly a lot of work to do here. I'm marking this as Returned
with Feedback in the commitfest, I don't think more review is going to
be helpful at this point.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Heikki Linnakangas (#28)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

Hi All,
I have been working on improving the previous implementation and addressing
TODOs in my previous mail. Let me explain the approach first and I will get
to Heikki's comments later in the same mail.

The patch provides support for atomic commit for transactions involving
foreign servers. When a transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback. We
should not see some changes committed and others rolled back.

Hooks and GUCs
==============
The patch introduces a GUC atomic_foreign_transaction, which when ON
ensures atomic commit for foreign transactions, otherwise not. The value of
this GUC at the time of committing or preparing a local transaction is
used. This gives applications the flexibility to choose the behaviour as
late in the transaction as possible. This GUC has no effect if there are no
foreign servers involved in the transaction.

Another GUC max_fdw_transactions sets the maximum number of transactions
that can be simultaneously prepared on all the foreign servers. This limits
the memory required for remembering the prepared foreign transactions.

Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given
foreign server connection. An FDW which doesn't want to support this
feature can keep this hook undefined (NULL). When defined the hook should
return a unique identifier for the transaction prepared on the foreign
server. The identifier should be unique enough not to conflict with
currently prepared or future transactions. This point will be clear when
discussing phase 2 of 2PC.

2. HandleForeignTransaction: to end a transaction in specified way. The
hook should be able to prepare/commit/rollback current running transaction
on given connection or commit/rollback a previously prepared transaction.
This is described in detail while describing phase two of two-phase commit.
The function is required to return a boolean status of whether the
requested operation was successful or not. The function or its minions
should not raise any error on failure so as not to interfere with the
distributed transaction processing. This point will be clarified more in
the description below.

Achieving atomic commit
===================
If atomic_foreign_transaction is enabled, two-commit protocol is used to
achieve atomic commit for transaction involving foreign servers. All the
foreign servers participating in such transaction should be capable of
participating in two-phase commit protocol. If not, the local and foreign
transactions are aborted as atomic commit can not be guaranteed.

Phase 1
-----------
Every FDW needs to register the connection while starting new transaction
on a foreign connection (RegisterXactForeignServer()). A foreign server
connection is identified by foreign server oid and the local user oid
(similar to the entry cached by postgres_fdw). While registering FDW also
tells whether the foreign server is capable of participating in two-phase
commit protocol. How to decide that is left entirely to the FDW. An FDW
like file_fdw may not have 2PC support at all, so all its foreign servers
do not comply with 2PC. An FDW might have all its servers 2PC compliant. An
FDW like postgres_fdw can have some of its servers compliant and some not,
depending upon server version, configuration (max_prepared_transactions =
0) etc. An FDW can decide not to register its connections at all and the
foreign servers belonging to that FDW will not be considered by the core at
all.

During pre-commit processing following steps are executed
1. GetPrepareId hook is called on each of the connections registered to get
the identifier that will be used to prepare the transaction.
2. For each connection the prepared transaction id along with the
connection information, database id and local transaction id (xid) is
recorded in the memory.
3. This is logged in XLOG. If standby is configured, it is replayed on
standby. In case of master failover a standby is able to resolve in-doubt
prepared transactions created by the master.
4. The information is written to an on-disk file in pg_fdw_xact/ directory.
This directory contains one file per prepared transaction on foreign
connection. The file is fsynced during checkpoint similar to pg_twophase
files. The file management in this directory is similar to the way, files
are managed in pg_twophase.
5. HandleForeignTransaction is called to prepare the transaction on given
connection with the identifier provided by GetPrepareId().

If the server crashes after step 5, we will remember the transaction
prepared on the foreign server and will try to abort it after recovery. If
it crashes after step 3 and completion of 5, we will remember a transaction
that was never prepared and try to resolve it later. This scenario will be
described while describing phase 2.

If any of the steps fail including the PREPARE on the foreign server
itself, the local transaction will be aborted. All the prepared
transactions on foreign servers will be aborted as described in phase 2
discussion below. Yet to be prepared transactions are rolled back by using
the same hook. If step 5 fails, the prepared foreign transaction entry is
removed from memory and disk following steps 2,3,4 in phase 2.
HandleForeignTransaction throwing an error will interfere with this, so it
is not expected to throw an error.

If the transactions are prepared on all the foreign servers successfully,
we enter phase 2 of 2PC.

The local transaction is not required to be prepared per say.

Phase 2
-----------
After the local transaction has committed or aborted the foreign
transactions prepared as part of the local transaction as committed or
aborted resp. Committing or aborting prepared foreign transaction is
collectively henceforth termed as "resolving" for simplicity. Following
steps are executed while resolving a foreign prepared transaction.

1. Resolve the foreign prepared transaction on corresponding foreign server
using user mapping of local user used at the time of preparing the
transaction. This is done through hook HandleForeignTransaction().
2. If the resolution is successful, remove the prepared foreign transaction
entry from the memory
3. Log about removal of entry in XLOG. When this log is replayed during
recovery or in standby mode, it executes step 4 below.
4. Remove the corresponding file from pg_fdw_xact directory.

If the resolution is unsuccessful, leave the entry untouched. Since this
phase is carried out when no transaction exists, HandleForeignTransaction
should not throw an error and should be designed not to access database
while performing this operation.

In case server crashes after step 1 and before step 3, a resolved foreign
transaction will be considered unresolved when the local server recovers or
standby takes over the master. It will try to resolve the prepared
transaction again and should get an error from foreign server.
HandleForeignTransaction hook should treat this as normal and return true
since the prepared transaction is resolved (or rather there is nothing that
can be done). For such cases it is important that GetPrepareId returns a
transaction identifier which does not conflict with a future tansaction id,
lest we may resolve (may be with wrong outcome) a prepared transaction
which shouldn't be resolved.

Any crash or connection failure in phase 2 leaves the prepared transaction
in unresolved state.

Resolving unresolved foreign transactions
================================
A local/foreign server crash or connection failure after a transaction is
prepared on the foreign server leaves that transaction in unresolved state.
The patch provides a built-in function pg_fdw_resolve() to resolve those
after recovering from the failure. This built-in scans all the prepared
transactions in-memory and decides the fate (commit/rollback) based on the
fate of local transaction that prepared it on the foreign server. It does
not touch entries corresponding to the in-progress local transactions. It
then executes the same steps as phase 2 to resolve the prepared foreign
transactions. Since foreign server information is contained within a
database, the function only touches the entries corresponding to the
database from which it is invoked. A user can configure a daemon or
cron-job to execute this function frequently from various databases.
Alternatively, user can use contrib module pg_fdw_xact_resolver which does
the same using background worker mechanism. This module needs to be
installed and listed in shared_preload_libraries to start the daemon
automatically on the startup.

A foreign server, user mapping corresponding to an unresolved foreign
transaction is not allowed to be dropped or altered until the foreign
transaction is resolved. This is required to retain the connection
properties which need to resolve the prepared transaction on the foreign
server.

Crash recovery
============
During crash recovery, the files in pg_fdw_xact/ are created or removed
when corresponding WAL records are replayed. After the redo is done
pg_fdw_xact directory is scanned for unresolved foreign prepared
transactions. The files in this directory are named as triplet (xid,
foreign server oid, user oid) to create a unique name for each file. This
scan also emits the oldest transaction id with an unresolved prepared
foreign transactions. This affects oldest active transaction id, since the
status of this transaction id is required to decide the fate of unresolved
prepared foreign transaction.

On standby during WAL replay files are just created or removed. If the
standby is required to finish recovery and take over the master,
pg_fdw_xact is scanned to read unresolved foreign prepared transactions
into the shared memory.

Preparing transaction involving foreign server/s, on local server
=================================================
While PREPARing a local transaction that involves foreign servers, the
transactions are prepared on the foreign server (as described in phase 1
above), if atomic_foreign_transaction is enabled. If the GUC is disabled,
such local transactions can not be prepared (as of this patch at least).
This also means that all the foreign servers participating in the
transaction to be prepared are required to support 2PC. While
committing/rolling back the prepared transaction the corresponding foreign
prepared transactions are committed or rolled back (as described in phase
2) resp. Any unresolved foreign transactions are resolved the same way as
above.

View for checking the current foreign prepared transactions
=============================================
A built-in function pg_fdw_xact() lists all the currently prepared foreign
transactions. This function does not list anything on standby while its
replaying WAL, since it doesn't have any entry in-memory. A convenient view
pg_fdw_xacts lists the same with the oids converted to the names.

Handling non-atomic foreign transactions
===============================
When atomic_foreign_transaction is disabled, one-phase commit protocol is
used to commit/rollback the foreign transactions. After the local
transaction has committed/aborted, all the foreign transactions on the
registered foreign connections are committed or aborted resp. using hook
HandleForeignTransaction. Failing to commit a foreign transaction does not
affect the other foreign transactions; they are still tried to be committed
(if the local transaction commits).

PITR
====
PITR may rewind the database to a point before an xid associated with an
unresolved foreign transaction. There are two approaches to deal with the
situation.
1. Just forget about the unresolved foreign transaction and remove the file
just like we do for a prepared local transaction. But then the prepared
transaction on the foreign server might be left unresolved forever and will
keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the transaction
id without getting rid of prepared foreign transaction. If we do so, we
might create conflicting files in future and might resolve the transaction
with wrong outcome.

Rest of the mail contains replies to Heikki's comments.

On Tue, Jul 7, 2015 at 2:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:

Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

Wow, this is going to be a lot of new infrastructure. This is going to
need good documentation, explaining how two-phase commit works in general,
how it's implemented, how to monitor it etc. It's important to explain all
the possible failure scenarios where you're left with in-doubt
transactions, and how the DBA can resolve them.

I have included some documentation in the patch. Once we agree on the
functionality, design, I will improve the documentation further.

Since we're building a Transaction Manager into PostgreSQL, please put a
lot of thought on what kind of APIs it provides to the rest of the system.
APIs for monitoring it, configuring it, etc. And how an extension could
participate in a transaction, without necessarily being an FDW.

The patch has added all of it except extension thing. Let me know if
anything is missing.

Even today and extension can participate in a transaction by registering
transaction and subtransaction call backs. So, as long as an extension (and
so some FDW) does things such that the failures in those do not affect the
atomicity, they can use these callbacks. However, these call backs are not
enough to handle unresolved prepared transactions or handle connectivity
failures in the phase 2. The patch adds infrastructure to do that.

dblink might be something on your mind, but to support dblink here, it will
need too liberal format for storing information about the prepared
transactions on other servers. This format will vary from extension to
extension, and may not be very useful as above. What we might be able to do
is expose the functions for creating files for prepared transactions and
logging about them and let extension use them. BTW, dblink_plus already
supports 2PC for dblink.

Regarding the configuration, there are many different behaviours that an
FDW could implement:

1. The FDW is read-only. Commit/abort behaviour is moot.

I can think of two flavours of read-only FDW: 1. the underlying data is
read-only 2. the FDW is read-only but the underlying data is not.
In first case, the FDW may choose not to participate in the transaction
management at all, so doesn't register the foreign connections. Still the
rest of the transaction will be atomic.

In second case however, the writes to other foreign server may depend upon
what has been read from the read-only FDW esp. in repeatable read and
higher isolation levels. So it's important that the data once read remains
intact till the transaction commits or at least is prepared, implying we
have to start a transaction on the read-only foreign server. Once the other
foreign transactions get prepared, we might be able to commit the
transaction on read-only foreign server. That optimization is not yet
implemented by my patch. But it should be possible to do in the approach
taken by the patch. Can we leave that as a future enhancement?

Does that solve your concern?

2. Transactions are not supported. All updates happen immediately
regardless of the local transaction.

An FDW can choose not to register its server and local PostgreSQL won't
know about it. Is that acceptable behaviour?

3. Transactions are supported, but two-phase commit is not. There are
three different ways we can use the remote transactions in that case:

This case is supported by using GUC atomic_foreign_transaction. The patch
implements 3.2 approach.

3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do safe
two-phase commit using so-called Last Resource Optimization.

IIUC LRO, the patch uses the local transaction as last resource, which is
always present. The fate of foreign transaction is decided by the fate of
the local transaction, which is not required to be prepared per say. There
is more relevant note later.

4. Full two-phases commit support

We don't necessarily have to support all of that, but let's keep all these
cases in mind when we design the how to configure FDWs. There's more to it
than "does it support 2PC".

A. Steps during transaction processing

------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server
is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that
by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the
form
of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed.

Just to be clear: you also need two-phase commit if the transaction
updated anything in the local server and in even one foreign server.

Any local transaction involving a foreign sever transaction uses two-phase
commit for the foreign transaction. The local transaction is not prepared
per say. However, we should be able to optimize a case, when there are no
local changes. I am not able to find a way to deduce that there was no
local change, so I have left that case in this patch. Is there a way to
know whether a local transaction changed something locally or not?

D. Persistent and in-memory storage considerations

--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the
same
route as 2PC for C.5.

Your current approach with a file that's flushed to disk on every update
has a few problems. Firstly, it's not crash safe. Secondly, if you make it
crash-safe with fsync(), performance will suffer. You're going to need to
need several fsyncs per commit with 2PC anyway, there's no way around that,
but the scalable way to do that is to use the WAL so that one fsync() can
flush more than one update in one operation.

So I think you'll need to do something similar to the pg_twophase files.
WAL-log each update, and only flush the file/files to disk on a checkpoint.
Perhaps you could use the pg_twophase infrastructure for this directly, by
essentially treating every local transaction as a two-phase transaction,
with some extra flag to indicate that it's an internally-created one.

I have used approach similar to pg_twophase, but implemented it as a
separate code, as the requirements differ. But, I would like to minimize
code by unifying both, if we finalise this design. Suggestions in this
regard will be very helpful.

2. New catalog - This method takes out the need to have separate method

for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as
well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with
the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog
entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which
can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code.
Something like a reference implementation, which I can follow. It will help
to lift two restrictions
1. Restriction on the number of simultaneously prepared foreign
transactions.
2. Restriction on the prepared transaction identifier length.

Obviously we may be able to shed a lot of code related to file managment,
lookup etc.

3. WAL records - Since the algorithm follows "write ahead of action", WAL

seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the
foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign
prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

Right. The pg_twophase files solve that exact same issue.

There is clearly a lot of work to do here.

I'm marking this as Returned with Feedback in the commitfest, I don't think

more review is going to be helpful at this point.

That's sad. Hope people to review the patch and help it improve, even if
it's out of commitfest.

- Heikki

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchtext/x-patch; charset=US-ASCII; name=pg_fdw_transact.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/* 
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL; 
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid); 
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/* 
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..564d13a 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -57,52 +59,57 @@ typedef struct ConnCacheEntry
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+									bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool is_server_twophase_compliant(ForeignServer *server);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -112,27 +119,23 @@ GetConnection(ForeignServer *server, UserMapping *user,
 		/* allocate ConnectionHash in the cache context */
 		ctl.hcxt = CacheMemoryContext;
 		ConnectionHash = hash_create("postgres_fdw connections", 8,
 									 &ctl,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
-		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
@@ -152,41 +155,64 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 	server->servername);
+			return NULL;
+		}
+
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+					bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
 	 */
 	PG_TRY();
 	{
 		const char **keywords;
 		const char **values;
@@ -227,25 +253,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
 			connmessage = pstrdup(PQerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+					   		server->servername),
+						errdetail_internal("%s", connmessage)));
 		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
 		if (!superuser() && !PQconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
@@ -362,29 +392,36 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid, entry->key.userid,
+									is_server_twophase_compliant(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,153 +543,257 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresHandleForeignTransaction(Oid serveroid, Oid userid, FDWXactAction action,
+									int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
+	StringInfo		command;
+	PGresult		*res;
+	bool			result;
+	ConnCacheEntry	*entry = NULL;
+
+	/* 
+	 * If we are in a transaction for committing or aborting a prepared
+	 * transaction, it must be the resolver. The connection cache may not have
+	 * the required connection. Use GetConnection() to open one if necessary.
+	 * Otherwise, we should have a cached connection and we will need
+	 * corresponding cache entry to set statuses.
+	 */
+	if (IsTransactionState() &&
+		(action == FDW_XACT_COMMIT_PREPARED || action == FDW_XACT_ABORT_PREPARED))
+	{
+		ForeignServer	*foreign_server = GetForeignServer(serveroid); 
+		UserMapping		*user_mapping = GetUserMapping(userid, serveroid);
+
+		conn = GetConnection(foreign_server, user_mapping, false, false, true);
+	}
+	else
+	{
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key.serverid = serveroid;
+		key.userid = userid;
+
+		Assert(ConnectionHash);
+		entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+		Assert(conn);
+	}
+
+	/* If we do not have a valid connection return false. */
+	if (!conn)
+		return false;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * Craft the command. Check if the connection is in the appropriate state.
+	 * 1. If action is COMMIT or PREPARE
+	 *    The cached entry should have a running transaction
+	 * 2. If action is ABORT
+	 *    some error on the connection might have caused abort, in which
+	 *    case the connection will have a running transaction but might be
+	 *    in aborted state.
+	 * 3. If action is COMMIT or ABORT prepared transaction
+	 *    the connection should be out of transaction.
+	 *
+	 * The connection might go bad after executing the last command (even if
+	 * that executed successfully), so do not assert for CONNECTION_OK.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+
+	command = makeStringInfo();
+	switch (action)
 	{
-		PGresult   *res;
+		case FDW_XACT_PREPARE:
+			Assert(entry);
+			Assert(PQstatus(conn) != CONNECTION_OK ||
+					PQtransactionStatus(conn) == PQTRANS_INTRANS);
+
+			appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+			break;
+
+		case FDW_XACT_COMMIT:
+			Assert(entry);
+			Assert(PQstatus(conn) != CONNECTION_OK ||
+				PQtransactionStatus(conn) == PQTRANS_INTRANS);
+
+			appendStringInfo(command, "COMMIT TRANSACTION");
+			break;
+
+		case FDW_XACT_ABORT:
+			Assert(entry);
+			Assert (PQstatus(conn) != CONNECTION_OK ||
+					(PQtransactionStatus(conn) == PQTRANS_INTRANS ||
+				 	PQtransactionStatus(conn) == PQTRANS_INERROR));
+
+			appendStringInfo(command, "ABORT TRANSACTION");
+			/* Assume we might have lost track of prepared statements */
+			entry->have_error = true;
+			break;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		case FDW_XACT_COMMIT_PREPARED:
+			Assert(PQstatus(conn) != CONNECTION_OK ||
+					PQtransactionStatus(conn) == PQTRANS_IDLE);
+
+			appendStringInfo(command, "COMMIT PREPARED '%.*s'", prep_info_len,
+																prep_info);
+			break;
+
+		case FDW_XACT_ABORT_PREPARED:
+			Assert(PQstatus(conn) != CONNECTION_OK ||
+					PQtransactionStatus(conn) == PQTRANS_IDLE);
+
+			appendStringInfo(command, "ROLLBACK PREPARED '%.*s'", prep_info_len,
+																	prep_info);
+			break;
+
+		default:
+			/* Should not happen */
+			return false;
+	}
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+	/* Execute the command */
+	res = PQexec(conn, command->data);
+
+	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	{
+		int		sqlstate;
+		char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+		/*
+		 * The command failed, raise a warning. The caller i.e. foreign
+		 * transaction manager takes care of taking appropriate action. We need
+		 * the execution result for further action, so don't clear it.
+		 */
+		pgfdw_report_error(WARNING, res, conn, false, command->data);
+
+		if (diag_sqlstate)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
-
-			switch (event)
-			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
-			}
+			sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+									 diag_sqlstate[1],
+									 diag_sqlstate[2],
+									 diag_sqlstate[3],
+									 diag_sqlstate[4]);
 		}
+		else
+			sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+		/*
+		 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+		 * transaction was missing on the foreign server, it was probably
+		 * resolved by some other means. Anyway, it should be considered as resolved.
+		 */
+		if ((action == FDW_XACT_COMMIT_PREPARED ||
+				action == FDW_XACT_ABORT_PREPARED) &&
+				sqlstate == ERRCODE_UNDEFINED_OBJECT)
+			result = true;
+		else
+			result = false;
+	}
+	else
+		result = true;
+
+	PQclear(res);
 
+	/* Clean up the cache entry, if we have one */
+	if (entry)
+	{
+		/*
+		 * If there were any errors in subtransactions, and we
+		 * made prepared statements, do a DEALLOCATE ALL to make
+		 * sure we get rid of all prepared statements. This is
+		 * annoying and not terribly bulletproof, but it's
+		 * probably not worth trying harder.
+		 *
+		 * DEALLOCATE ALL only exists in 8.3 and later, so this
+		 * constrains how old a server postgres_fdw can
+		 * communicate with.  We intentionally ignore errors in
+		 * the DEALLOCATE, so that we can hobble along to some
+		 * extent with older servers (leaking prepared statements
+		 * as we go; but we don't really support update operations
+		 * pre-8.3 anyway).
+		 */
+		if (entry->have_prep_stmt && entry->have_error)
+		{
+			res = PQexec(entry->conn, "DEALLOCATE ALL");
+			PQclear(res);
+		}
+		entry->have_prep_stmt = false;
+		entry->have_error = false;
 		/* Reset state to show we're out of a transaction */
 		entry->xact_depth = 0;
-
 		/*
 		 * If the connection isn't in a good idle state, discard it to
 		 * recover. Next GetConnection will open a new connection.
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
 			PQfinish(entry->conn);
 			entry->conn = NULL;
 		}
-	}
 
-	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-	 */
-	xact_got_connection = false;
+		/*
+		 * Regardless of the event type, we can now mark ourselves as out of the
+		 * transaction.
+		 */
+		xact_got_connection = false;
+	
+		/* Also reset cursor numbering for next transaction */
+		cursor_number = 0;
+	}
+	else
+		/* We got the connection through GetConnection(), release it */
+		ReleaseConnection(conn);
 
-	/* Also reset cursor numbering for next transaction */
-	cursor_number = 0;
+	return result;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
 pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 					   SubTransactionId parentSubid, void *arg)
 {
 	HASH_SEQ_STATUS scan;
@@ -708,10 +849,33 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * is_server_twophase_compliant
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+is_server_twophase_compliant(ForeignServer *server)
+{
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "twophase_compliant") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1f417b3..8574942 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1,20 +1,21 @@
 -- ===================================================================
 -- create FDW objects
 -- ===================================================================
 CREATE EXTENSION postgres_fdw;
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
 CREATE TYPE user_enum AS ENUM ('foo', 'bar', 'buz');
@@ -3634,10 +3635,383 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- tests with non-atomic foreign transactions (default)
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+(3 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+(4 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+(4 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+(5 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+(5 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+   7
+   8
+(7 rows)
+
+-- refill the local table for further tests.
+TRUNCATE TABLE lt;
+-- test with atomic foreign transactions
+SET atomic_foreign_transaction TO ON;
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_with_fdw           | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_with_fdw           | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+ database | foreign server | local user | status 
+----------+----------------+------------+--------
+(0 rows)
+
+SELECT database, gid FROM pg_prepared_xacts;
+ database | gid 
+----------+-----
+(0 rows)
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_fdw_subxact        | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_fdw_subxact        | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+TRUNCATE TABLE lt;
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+-- test with and without atomic foreign transaction
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+ERROR:  atomicity can not be guaranteed because some foreign server/s involved in transaction can not participate in two phase commit.
+SELECT * FROM lt;
+ val 
+-----
+(0 rows)
+
+SET atomic_foreign_transaction TO OFF;
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   1
+   2
+   3
+(3 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..ed956ab 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -98,21 +98,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "twophase_compliant") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -146,20 +147,22 @@ InitPgFdwOptions(void)
 		{"column_name", AttributeRelationId, false},
 		/* use_remote_estimate is available on both server and table */
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"twophase_compliant", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 6da01e1..9d1df25 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,22 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -362,20 +364,24 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->HandleForeignTransaction = postgresHandleForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -918,21 +924,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1316,21 +1322,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1766,21 +1772,21 @@ estimate_path_cost_size(PlannerInfo *root,
 		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
 						 &retrieved_attrs);
 		if (fpinfo->remote_conds)
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2330,21 +2336,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2422,21 +2428,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2623,21 +2629,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2987,10 +2993,11 @@ static void
 conversion_error_callback(void *arg)
 {
 	ConversionLocation *errpos = (ConversionLocation *) arg;
 	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
 		errcontext("column \"%s\" of foreign table \"%s\"",
 				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
 				   RelationGetRelationName(errpos->rel));
 }
+
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..69684bd 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -10,30 +10,32 @@
  *
  *-------------------------------------------------------------------------
  */
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
 #include "utils/relcache.h"
+#include "access/fdw_xact.h"
 
 #include "libpq-fe.h"
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
@@ -67,12 +69,15 @@ extern void deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 				 List *targetAttrs, List *returningList,
 				 List **retrieved_attrs);
 extern void deparseDeleteSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *returningList,
 				 List **retrieved_attrs);
 extern void deparseAnalyzeSizeSql(StringInfo buf, Relation rel);
 extern void deparseAnalyzeSql(StringInfo buf, Relation rel,
 				  List **retrieved_attrs);
 extern void deparseStringLiteral(StringInfo buf, const char *val);
+extern bool postgresHandleForeignTransaction(Oid serveroid, Oid userid, FDWXactAction action,
+											int prep_xact_len, char *prep_xact_name);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 
 #endif   /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index fcdd92e..f01e32b 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2,21 +2,22 @@
 -- create FDW objects
 -- ===================================================================
 
 CREATE EXTENSION postgres_fdw;
 
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -827,10 +828,244 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- tests with non-atomic foreign transactions (default)
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- refill the local table for further tests.
+TRUNCATE TABLE lt;
+
+-- test with atomic foreign transactions
+SET atomic_foreign_transaction TO ON;
+
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+SELECT database, gid FROM pg_prepared_xacts;
+
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+TRUNCATE TABLE lt;
+
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+-- test with and without atomic foreign transaction
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+SET atomic_foreign_transaction TO OFF;
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1da7dfb..288b8a1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1432,20 +1432,62 @@ include_dir 'conf.d'
        </para>
 
        <para>
         When running a standby server, you must set this parameter to the
         same or higher value than on the master server. Otherwise, queries
         will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-fdw-transactions" xreflabel="max_fdw_transactions">
+      <term><varname>max_fdw_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_fdw_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
        <primary><varname>work_mem</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
         Specifies the amount of memory to be used by internal sort operations
         and hash tables before writing to temporary disk files. The value
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2361577..9b931e4 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -918,20 +918,85 @@ ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);
      useful to test whether a given foreign-table name will pass the filter.
     </para>
 
     <para>
      If the FDW does not support importing table definitions, the
      <function>ImportForeignSchema</> pointer can be set to <literal>NULL</>.
     </para>
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-transactions">
+    <title>FDW Routines For transaction management</title>
+
+    <para>
+<programlisting>
+char *
+GetPrepareInfo (Oid serverOid, Oid userid, int *prep_info_len);
+</programlisting>
+
+     Get prepared transaction identifier for given foreign server and user.
+     This function is called when executing <xref linkend="sql-commit">, if
+     <literal>atomic_foreign_transaction</> is enabled. It should return a
+     valid transaction identifier that will be used to prepare transaction
+     on the foreign server. The <parameter>prep_info_len</> should be set
+     to the length of this identifier. The identifier should not be longer
+     than 256 bytes. The identifier should not cause conflict with existing
+     identifiers on the foreign server. It should be unique enough not to
+     identify a transaction in future. It's possible that a transaction is
+     considered unresolved on <productname>PostgreSQL</> while it is resolved
+     in reality. This causes the foreign transaction resolver to try resolving
+     the transaction till it finds out that the transaction has been resolved.
+     In case the transaction identifier is same as a future transaction identifier
+     there is a possibility of the future transaction getting resolved
+     erroneously.
+    </para>
+
+    <para>
+     If a foreign server with Foreign Data Wrapper having <literal>NULL</>
+      <function>GetPrepareInfo</> pointer participates in a transaction
+      with<literal>atomic_foreign_transaction</> enabled, the transaction
+      is aborted.
+    </para>
+
+    <para>
+<programlisting>
+bool
+HandleForeignTransaction (Oid serverOid, Oid userid, FDWXactAction action,
+                            int prep_id_len, char *prep_id)
+</programlisting>
+
+     Function to end a transaction on the given foreign server with given user.
+     This function is called when executing <xref linkend="sql-commit"> or
+     <xref linkend="sql-rollback">. The function should complete a transaction
+     according to the <parameter>action</> specified. The function should
+     return TRUE on successful completion of transaction and FALSE otherwise.
+     It should not throw an error in case of failure to complete the transaction.
+    </para>
+
+    <para>
+    When <parameter>action</> is FDW_XACT_COMMIT or FDW_XACT_ABORT, the function
+    should commit or rollback the running transaction resp. When <parameter>action</>
+    is FDW_XACT_PREPARE, the running transaction should be prepared with the
+    identifier given by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    When <parameter>action</> is FDW_XACT_ABORT_PREPARED or FDW_XACT_COMMIT_PREPARED
+    the function should respectively commit or rollback the transaction identified
+    by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    </para>
+
+    <para>
+    This function is usually called at the end of the transaction, when the
+    access to the database may not be possible. Trying to access catalogs
+    in this function may cause error to be thrown and can affect other foreign
+    data wrappers. 
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
     <title>Foreign Data Wrapper Helper Functions</title>
 
     <para>
      Several helper functions are exported from the core server so that
      authors of foreign data wrappers can get easy access to attributes of
      FDW-related objects, such as FDW options.
      To use any of these functions, you need to include the header file
@@ -1308,11 +1373,93 @@ GetForeignServerByName(const char *name, bool missing_ok);
     <para>
      See <filename>src/include/nodes/lockoptions.h</>, the comments
      for <type>RowMarkType</> and <type>PlanRowMark</>
      in <filename>src/include/nodes/plannodes.h</>, and the comments for
      <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
      additional information.
     </para>
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..8c1afcf 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -1,16 +1,16 @@
 #
 # Makefile for the rmgr descriptor routines
 #
 # src/backend/access/rmgrdesc/Makefile
 #
 
 subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o gindesc.o \
+	   gistdesc.o hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
 	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..0f0c899
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module describes the WAL records for foreign transaction manager. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 4f29136..ad07038 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -104,28 +104,29 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 			if (entry->val == xlrec.wal_level)
 			{
 				wal_level_str = entry->name;
 				break;
 			}
 		}
 
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamps=%s",
+						 "track_commit_timestamps=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
 	else if (info == XLOG_END_OF_RECOVERY)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 94455b2..51b2efd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o parallel.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..291900c
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,1884 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * At the time of commit, GUC atomic_foreign_transactions controls whether we
+ * attempt to commit foreign transactions atomically along with the local
+ * transaction or not. 
+ *
+ * If atomic_foreign_transactions is enabled, Two-phase commit protocol is used
+ * to achieve atomicity. In the first phase transactions are prepared on all
+ * participating foreign servers. If first phase succeeds, foreign servers are
+ * requested to commit respective prepared transactions. If the first phase
+ * does not succeed because of any failure, the foreign servers are asked to
+ * rollback respective prepared transactions or abort the transactions if they
+ * are not prepared. Any network failure, server crash after preparing foreign
+ * transaction leaves that prepared transaction unresolved. During the first
+ * phase, before actually preparing the transactions, enough information is
+ * persisted to the disk and logs in order to resolve such transactions. The
+ * first phase is executed during pre-commit processing. The second phase is
+ * executed during post-commit processing or when processing an aborted
+ * transaction. 
+ *
+ * If atomic_foreign_transactions is disabled, One-phase commit protocol is
+ * used. During post-commit processing or whe processing an aborted transaction
+ * foreign servers are respectively asked to commit or rollback their
+ * transactions. Failures in executing this step on any single foreign server
+ * does not affect the other foreign servers. Thus if the local transaction
+ * commits, it is not guaranteed that all the participating foreign servers
+ * commit their respective transactions. 
+ */
+
+/* GUC which controls atomicity of transactions involving foreign servers */
+bool	atomic_foreign_xact = false;
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */ 
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	GetPrepareId_function		prepare_id_provider;
+	HandleForeignTransaction_function	fdw_xact_handler;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction are two phase commit compliant.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_compliant)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine 		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_compliant;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u with user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(fdw_conn->serverid); 
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->HandleForeignTransaction)
+			elog(ERROR, "no foreign transaction handler provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_compliant && !fdw_routine->GetPrepareId)
+		elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->fdw_xact_handler = fdw_routine->HandleForeignTransaction;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->fdw_xact = NULL;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */ 
+	Oid				serveroid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	FDWXactAction	fdw_xact_action;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */ 
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serveroid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serveroid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact, HandleForeignTransaction_function fdw_xact_handler);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactAction fdw_xact_action);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid,
+								Oid userid, int fdw_xact_info_len,
+								char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static HandleForeignTransaction_function get_fdw_xact_handler(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serveroid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid,
+								bool giveWarning);
+
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * If user has requested atomicity for transactions involved
+ * (atomic_foreign_transaction GUC enabled), this
+ * function executes the first phase of 2PC.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*lcell;
+	/* Quick exit, if there are no foreign servers involved */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/* Non-atomic foreign transactions, nothing do here */
+	if (!atomic_foreign_xact)
+		return;
+	/*
+	 * If user expects the transactions involving foreign servers to be atomic,
+	 * make sure that every foreign server can participate in two phase commit.
+	 * Checking this GUC value at the time of COMMIT allows user to set it
+	 * during the transaction depending upon the foreign servers involved.
+	 */
+	if (atomic_foreign_xact && !TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("atomicity can not be guaranteed because some foreign server/s involved in transaction can not participate in two phase commit.")));
+
+	/* 
+	 * Prepare foreign transactions.
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+															fdw_conn->userid,
+															&fdw_xact_info_len);
+		
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 * 
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+											fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->fdw_xact_handler(fdw_conn->serverid, fdw_conn->userid,
+											FDW_XACT_PREPARE,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact; 
+	}
+}
+
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serveroid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARE);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid; 
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serveroid = fdw_xact->serveroid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->fdw_xact_action = fdw_xact->fdw_xact_action;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serveroid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */ 
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.  Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.  The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */ 
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id,
+					FDWXactAction fdw_xact_action)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serveroid == serveroid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serveroid %u, userid %u found",
+						xid, serveroid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_fdw_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = serveroid;
+	fdw_xact->userid = userid;
+	fdw_xact->fdw_xact_action = fdw_xact_action; 
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */ 
+			fdw_remove_xlog.serveroid = fdw_xact->serveroid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serveroid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/* 
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *    according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *    succeeds, remove them from foreign transaction entries, otherwise unlock
+ *    them.
+ */ 
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+	/*
+	 * For non-atomic foreign transactions, commit/abort the transactions on
+	 * foreign server/s. For atomic foreign transactions, commit/abort the
+	 * prepared transactions.
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			/*
+			 * We should be preparing foreign transaction, only if we are in
+			 * aotmic foreign transaction mode.
+			 */
+			Assert(atomic_foreign_xact);
+			fdw_xact->fdw_xact_action = (is_commit ?
+											FDW_XACT_COMMIT_PREPARED :
+											FDW_XACT_ABORT_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->fdw_xact_handler))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * If we are in atomic foreign transaction mode and committing the
+			 * transaction, we should have prepared all the transaction. Only
+			 * when the transaction is aborted while the foreign transactions
+			 * are being prepared, we will end up here in atomic foreign
+			 * transaction mode.
+			 */
+			Assert(!atomic_foreign_xact || !is_commit);
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->fdw_xact_handler(fdw_conn->serverid, fdw_conn->userid,
+										is_commit ? FDW_XACT_COMMIT : FDW_XACT_ABORT,
+										0, NULL))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* Nothing to do here, if there are no foreign servers involved. */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	if (!atomic_foreign_xact)
+	{
+		ereport(ERROR,
+			(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION,
+			 errmsg("atomic_foreign_xact should be enabled to prepared transaction involving foreign servers.")));
+	}
+
+	/* Prepare transactions on participating foreign servers. */
+	PreCommit_FDWXacts();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/* Free the list of registered foreign servers */
+	MyFDWConnections = NIL;
+}
+
+/* 
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	int	cnt_xact;
+	/*
+	 * Since COMMIT/ROLLBACK PREPARED is not run in a transaction, we should not
+	 * have any foreign transaction entries already locked by us.
+	 */
+	Assert(!MyLockedFDWXacts);
+
+	/* Lock entries with given transaction ids and mark their fate */
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+		FDWXactAction	action = isCommit ? FDW_XACT_COMMIT_PREPARED :
+											FDW_XACT_ABORT_PREPARED;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/*
+		 * If foreign transaction resolver is running, it might lock entries
+		 * to check whether they can be resolved. Skip such entries. The
+		 * resolver will resolve them at a later point of time.
+		 */
+		if (fdw_xact->locking_backend != InvalidBackendId)
+			continue;
+
+		if (fdw_xact->local_xid == xid)
+		{
+			MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+			fdw_xact->locking_backend = MyBackendId;
+			fdw_xact->fdw_xact_action = action;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* Try resolving the foreign transactions */
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = linitial(MyLockedFDWXacts);
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact, get_fdw_xact_handler(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static HandleForeignTransaction_function
+get_fdw_xact_handler(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->HandleForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->HandleForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool 
+resolve_fdw_xact(FDWXact fdw_xact,
+				HandleForeignTransaction_function fdw_xact_handler)
+{
+	bool				resolved;
+
+	Assert(fdw_xact->fdw_xact_action == FDW_XACT_COMMIT_PREPARED ||
+			fdw_xact->fdw_xact_action == FDW_XACT_ABORT_PREPARED);
+
+	resolved = fdw_xact_handler(fdw_xact->serveroid, fdw_xact->userid,
+								fdw_xact->fdw_xact_action,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+	
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Return true if there exists at least one foreign transaction entry with given
+ * criteria.
+ * The criteria is defined by arguments with valid values for respective
+ * datatypes. The table below explains the same
+ * xid     | serveroid  | userid  | search for           
+ * invalid | invalid    | invalid | nothing             
+ * valid   | invalid    | invalid | entry with given xid
+ * invalid | valid      | invalid | entry with given serveroid
+ * invalid | invalid    | valid   | entry with given userid
+ * valid   | valid      | invalid | entry with given xid and serveroid
+ * valid   | invalid    | valid   | entry with given xid and userid
+ * invalid | valid      | valid   | entry with given servroid and userid
+ * valid   | valid      | valid   | entry with given xid, servroid and userid
+ *
+ * If there exists an entry with the given criteria, the function returns true,
+ * otherwise false. When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * dbid augments server and user oids and should be valid if either of them is
+ * valid. Its validity is not checked.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dboid, Oid serveroid, Oid userid)
+{
+	int	cnt;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/*
+		 * If xid is valid and it doesn't match entry's xid, the entry doesn't
+		 * match the criteria.
+		 */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* Now match the database, if either serveroid or userid is valid */
+		if ((OidIsValid(serveroid) || OidIsValid(userid)) &&
+			fdw_xact->dboid != dboid)
+			entry_matches = false;
+		
+		/*
+		 * If serveroid is valid and doesn't match entry's server, the entry
+		 * doesn't match.
+		 */
+		if (OidIsValid(serveroid) && serveroid != fdw_xact->serveroid)
+			entry_matches = false;
+
+		/*
+		 * If userid is valid and doesn't match entry's server, the entry
+		 * doesn't match.
+		 */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			LWLockRelease(FDWXactLock);
+			return true;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	/* We did not find any matching entry */
+	return false;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+	
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	  		*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+		
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serveroid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+	
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+	
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+		
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serveroid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serveroids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serveroids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serveroids[nxacts] = fdw_xact->serveroid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serveroids[cnt], userids[cnt]);
+			
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serveroids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serveroids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_action)
+		{
+			case FDW_XACT_PREPARE:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMIT_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORT_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		int	cnt_xact;
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+	
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * The transaction entries are created at the time of commit and this
+		 * function unlocks the entries it locked before it finishes, so we
+		 * shouldn't see any locked entries at the beginning.
+		 */
+		Assert(!MyLockedFDWXacts);
+	
+		/* The list and its members may be required at the end of the transaction */
+		oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+	
+		/* Take exclusive lock, since we will be locking the entries */ 
+		LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+		/* Scan all the foreign transaction */
+		for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+		{
+			FDWXact	fdw_xact;
+	
+			fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+			
+			/*
+			 * Pick up entries whose foreign servers are part of the database where
+			 * this function was called. We can get information about only such
+			 * foreign servers. Lock the entries that are picked, so that other
+			 * backends will stay away from them.
+			 */
+			if (fdw_xact->dboid == MyDatabaseId &&
+					fdw_xact->locking_backend == InvalidBackendId)
+			{
+				/*
+				 * Remember the locked entries so that we can release them if need
+				 * to exit abruptly.
+				 */
+				MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+				fdw_xact->locking_backend = MyBackendId;
+			}
+		}
+		LWLockRelease(FDWXactLock);
+		MemoryContextSwitchTo(oldcontext);
+	
+		/* Work to resolve the resolvable entries */
+		while (MyLockedFDWXacts)
+		{
+			FDWXact	fdw_xact = linitial(MyLockedFDWXacts);
+	
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_action == FDW_XACT_COMMIT_PREPARED ||
+					fdw_xact->fdw_xact_action == FDW_XACT_ABORT_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_action = FDW_XACT_COMMIT_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_action = FDW_XACT_ABORT_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_action = FDW_XACT_ABORT_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+	
+	
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_fdw_xact_handler(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_action = FDW_XACT_RESOLVED;
+		}
+	}
+	
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_action)
+		{
+			case FDW_XACT_PREPARE:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMIT_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORT_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serveroid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serveroid != serveroid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+	
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *    conflicting file names.
+ *
+ * While it's possible to avoid 1 by recoding the transaction status in the
+ * file, 2 looks unavoidable.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function throws error, unlike PrescanPreparedTransactions. While it
+ * suffices to remove a two-phase file, it is not the case with
+ * foreign prepared transaction file, which merely points to a prepared
+ * transaction on the foreign server. Removing such file would make PostgreSQL
+ * forget the prepared foreign transaction, which might remain unresolved
+ * forever. 
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			/*
+			 * Throw error if the transaction which prepared this foreign
+			 * transaction is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+				ereport(ERROR,
+					  (errmsg("a future foreign prepared transaction file \"%s\" found",
+							  clde->d_name)));
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+	
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serveroid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serveroid, userid)));
+
+			/* Add this entry into the table of foreign transactions. */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										serveroid, userid,
+										fdw_xact_file_data->fdw_xact_id_len,
+										fdw_xact_file_data->fdw_xact_id,
+										fdw_xact_file_data->fdw_xact_action);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */	
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+	
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..cdbc583 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -7,20 +7,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 177d1e1..5c9aec7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -34,20 +34,21 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <time.h>
 #include <unistd.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
@@ -1469,20 +1470,26 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		RelationCacheInitFilePostInvalidate();
 
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
 	else
 		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
 	/*
 	 * And now we can clean up our mess.
 	 */
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
 	MyLockedGxact = NULL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b53d95f..aaa0edc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -14,20 +14,21 @@
  *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -179,20 +180,24 @@ typedef struct TransactionStateData
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -1884,20 +1889,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -1940,20 +1948,23 @@ CommitTransaction(void)
 
 		/*
 		 * Close open portals (converting holdable ones into static portals).
 		 * If there weren't any, we are done ... otherwise loop back to check
 		 * if they queued deferred triggers.  Lather, rinse, repeat.
 		 */
 		if (!PreCommit_Portals(false))
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
 	/*
 	 * The remaining actions cannot call any user-defined code, so it's safe
 	 * to start shutting down within-transaction services.  But note that most
 	 * of this stuff could still throw an error, which would switch us into
 	 * the transaction-abort path.
 	 */
 
@@ -2099,20 +2110,21 @@ CommitTransaction(void)
 	AtEOXact_GUC(true, 1);
 	AtEOXact_SPI(true);
 	AtEOXact_on_commit_actions(true);
 	AtEOXact_Namespace(true, is_parallel_worker);
 	AtEOXact_SMgr();
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
 	TopTransactionResourceOwner = NULL;
 
 	AtCommit_Memory();
 
@@ -2283,20 +2295,21 @@ PrepareTransaction(void)
 	 * before or after releasing the transaction's locks.
 	 */
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
 	 *
 	 * We have to record transaction prepares even if we didn't make any
 	 * updates, because the transaction manager might get confused if we lose
 	 * a global transaction.
 	 */
 	EndPrepare(gxact);
 
@@ -2565,20 +2578,21 @@ AbortTransaction(void)
 
 		AtEOXact_GUC(false, 1);
 		AtEOXact_SPI(false);
 		AtEOXact_on_commit_actions(false);
 		AtEOXact_Namespace(false, is_parallel_worker);
 		AtEOXact_SMgr();
 		AtEOXact_Files();
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
 	RESUME_INTERRUPTS();
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e37ad3..cb52884 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -16,20 +16,21 @@
 
 #include <ctype.h>
 #include <time.h>
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -4841,20 +4842,21 @@ BootStrapXLOG(void)
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
 	WriteControlFile();
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5847,20 +5849,26 @@ CheckRequiredParameterValues(void)
 									 ControlFile->max_worker_processes);
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
 		RecoveryRequiresBoolParameter("track_commit_timestamp",
 									  track_commit_timestamp,
 									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresIntParameter("max_fdw_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
 StartupXLOG(void)
 {
 	XLogCtlInsert *Insert;
@@ -6522,21 +6530,24 @@ StartupXLOG(void)
 		{
 			TransactionId *xids;
 			int			nxids;
 
 			ereport(DEBUG1,
 					(errmsg("initializing for hot standby")));
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
 
 			/* Tell procarray about the range of xids it has to deal with */
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
 			 * Startup commit log, commit timestamp and subtrans only.
 			 * MultiXact has already been started up and other SLRUs are not
@@ -7122,20 +7133,21 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7319,20 +7331,26 @@ StartupXLOG(void)
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
 	TrimCLOG();
 	TrimMultiXact();
 
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
 		close(readFile);
@@ -8593,20 +8611,25 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
 	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
  * Save a checkpoint for recovery restart if appropriate
  *
  * This function is called each time a checkpoint record is read from XLOG.
  * It must determine whether the checkpoint represents a safe restartpoint or
  * not.  If so, the checkpoint record is stashed in shared memory so that
  * CreateRestartPoint can consult it.  (Note that the latter function is
  * executed by the checkpointer, while this one will be executed by the
@@ -9018,56 +9041,59 @@ XLogRestorePoint(const char *rpName)
  */
 static void
 XLogReportParameters(void)
 {
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
 		 * if archiving is not enabled, as you can't start archive recovery
 		 * with wal_level=minimal anyway. We don't really care about the
 		 * values in pg_control either if wal_level=minimal, but seems better
 		 * to keep them up-to-date to avoid confusion.
 		 */
 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
 		{
 			xl_parameter_change xlrec;
 			XLogRecPtr	recptr;
 
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 			recptr = XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE);
 			XLogFlush(recptr);
 		}
 
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
 
 /*
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
  * Note: this function assumes there is no other process running
  * concurrently that could update it.
@@ -9241,20 +9267,21 @@ xlog_redo(XLogReaderState *record)
 		 */
 		if (standbyState >= STANDBY_INITIALIZED)
 		{
 			TransactionId *xids;
 			int			nxids;
 			TransactionId oldestActiveXID;
 			TransactionId latestCompletedXid;
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
 			 * down server, with only prepared transactions still alive. We're
 			 * never overflowed at this point because all subxids are listed
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
 			running.subxcnt = 0;
 			running.subxid_overflow = false;
@@ -9430,20 +9457,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
 		 * recover back up to this point before allowing hot standby again.
 		 * This is particularly important if wal_level was set to 'archive'
 		 * before, and is now 'hot_standby', to ensure you don't run queries
 		 * against the WAL preceding the wal_level change. Same applies to
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 95d6c14..3100f50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -11,20 +11,21 @@
  *	  src/backend/bootstrap/bootstrap.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/htup_details.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e82a53a..a35b41e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -234,20 +234,29 @@ CREATE VIEW pg_available_extension_versions AS
            LEFT JOIN pg_extension AS X
              ON E.name = X.extname AND E.version = X.extversion;
 
 CREATE VIEW pg_prepared_xacts AS
     SELECT P.transaction, P.gid, P.prepared,
            U.rolname AS owner, D.datname AS database
     FROM pg_prepared_xact() AS P
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier" 
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
 	CASE WHEN rel.relkind = 'r' THEN 'table'::text
 		 WHEN rel.relkind = 'v' THEN 'view'::text
 		 WHEN rel.relkind = 'm' THEN 'materialized view'::text
 		 WHEN rel.relkind = 'S' THEN 'sequence'::text
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 3b85c2c..9f8efa9 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -6,20 +6,21 @@
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  *
  *
  * IDENTIFICATION
  *	  src/backend/commands/foreigncmds.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
@@ -998,20 +999,29 @@ AlterForeignServer(AlterForeignServerStmt *stmt)
 	srvId = HeapTupleGetOid(tp);
 	srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
 
 	/*
 	 * Only owner or a superuser can ALTER a SERVER.
 	 */
 	if (!pg_foreign_server_ownercheck(srvId, GetUserId()))
 		aclcheck_error(ACLCHECK_NOT_OWNER, ACL_KIND_FOREIGN_SERVER,
 					   stmt->servername);
 
+	/*
+	 * If there is a foreign transaction prepared on the foreign server,
+	 * ALTERing its properties, might leave the prepared transaction dangling.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+							stmt->servername)));
+
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
 
 	if (stmt->has_version)
 	{
 		/*
 		 * Change the server VERSION string.
 		 */
 		if (stmt->version)
@@ -1079,20 +1089,34 @@ RemoveForeignServerById(Oid srvId)
 	HeapTuple	tp;
 	Relation	rel;
 
 	rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
 
 	tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
 
 	heap_close(rel, RowExclusiveLock);
 }
 
 
 /*
  * Common routine to check permission for user-mapping-related DDL
@@ -1251,20 +1275,31 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
 
 	umId = GetSysCacheOid2(USERMAPPINGUSERSERVER,
 						   ObjectIdGetDatum(useId),
 						   ObjectIdGetDatum(srv->serverid));
 	if (!OidIsValid(umId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("user mapping \"%s\" does not exist for the server",
 						MappingUserName(useId))));
 
+
+	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * changing properties of the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid, useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
 	user_mapping_ddl_aclcheck(useId, srv->serverid, stmt->servername);
 
 	tp = SearchSysCacheCopy1(USERMAPPINGOID, ObjectIdGetDatum(umId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for user mapping %u", umId);
 
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
@@ -1377,20 +1412,31 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 		/* IF EXISTS specified, just note it */
 		ereport(NOTICE,
 		(errmsg("user mapping \"%s\" does not exist for the server, skipping",
 				MappingUserName(useId))));
 		return InvalidOid;
 	}
 
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
 	object.objectId = umId;
 	object.objectSubId = 0;
 
 	performDeletion(&object, DROP_CASCADE, 0);
 
 	return umId;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1757b4d..c2e0019 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -86,20 +86,21 @@
 #ifdef USE_BONJOUR
 #include <dns_sd.h>
 #endif
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
 #include <pthread.h>
 #endif
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "lib/ilist.h"
 #include "libpq/auth.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
@@ -2438,21 +2439,20 @@ pmdie(SIGNAL_ARGS)
 				SignalUnconnectedWorkers(SIGTERM);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
 				/* and the bgwriter too */
 				if (BgWriterPID != 0)
 					signal_child(BgWriterPID, SIGTERM);
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-
 				/*
 				 * If we're in recovery, we can't kill the startup process
 				 * right away, because at present doing so does not release
 				 * its locks.  We might want to change this in a future
 				 * release.  For the time being, the PM_WAIT_READONLY state
 				 * indicates that we're waiting for the regular (read only)
 				 * backends to die off; once they do, we'll kill the startup
 				 * and walreceiver processes.
 				 */
 				pmState = (pmState == PM_RUN) ?
@@ -5664,20 +5664,21 @@ PostmasterMarkPIDForWorkerNotify(int pid)
 
 	dlist_foreach(iter, &BackendList)
 	{
 		bp = dlist_container(Backend, elem, iter.cur);
 		if (bp->pid == pid)
 		{
 			bp->bgworker_notify = true;
 			return true;
 		}
 	}
+
 	return false;
 }
 
 #ifdef EXEC_BACKEND
 
 /*
  * The following need to be available to the save/restore_backend_variables
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..6fdd818 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -127,20 +127,21 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_MULTIXACT_ID:
 		case RM_RELMAP_ID:
 		case RM_BTREE_ID:
 		case RM_HASH_ID:
 		case RM_GIN_ID:
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
 }
 
 /*
  * Handle rmgr XLOG_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a790e5b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -132,20 +133,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, ReplicationOriginShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -243,20 +245,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 230c5cc..37d2638 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -20,20 +20,21 @@
 #include <float.h>
 #include <math.h>
 #include <limits.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #ifdef HAVE_SYSLOG
 #include <syslog.h>
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
 #include "commands/trigger.h"
@@ -1607,20 +1608,31 @@ static struct config_bool ConfigureNamesBool[] =
 		{"data_checksums", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows whether data checksums are turned on for this cluster."),
 			NULL,
 			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
 		},
 		&data_checksums,
 		false,
 		NULL, NULL, NULL
 	},
 
+	{
+		{"atomic_foreign_transaction", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Transactions involving foreign servers are atomic."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&atomic_foreign_xact,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
 	}
 };
 
 
 static struct config_int ConfigureNamesInt[] =
 {
 	{
@@ -1995,20 +2007,33 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
 			NULL
 		},
 		&max_prepared_xacts,
 		0, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_fdw_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Sets the minimum OID of tables for tracking locks."),
 			gettext_noop("Is used to avoid output on system tables."),
 			GUC_NOT_IN_SAMPLE
 		},
 		&Trace_lock_oidmin,
 		FirstNormalObjectId, 0, INT_MAX,
 		NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..5f4cb1b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -117,20 +117,26 @@
 					# (change requires restart)
 #huge_pages = try			# on, off, or try
 					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
 # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
 # per transaction slot, plus lock space (see max_locks_per_transaction).
 # It is not advisable to set max_prepared_transactions nonzero unless you
 # actively intend to use prepared transactions.
+#max_fdw_transactions = 0		# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_fdw_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_fdw_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature. 
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
 					# supported by the operating system:
 					#   posix
 					#   sysv
 					#   windows
 					#   mmap
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index feeff9e..47ecf1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,31 +192,32 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
 	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d8cfe5e..00aad71 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -324,12 +324,14 @@ main(int argc, char *argv[])
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile.loblksize);
 	printf(_("Date/time type storage:               %s\n"),
 		   (ControlFile.enableIntTimes ? _("64-bit integers") : _("floating-point numbers")));
 	printf(_("Float4 argument passing:              %s\n"),
 		   (ControlFile.float4ByVal ? _("by value") : _("by reference")));
 	printf(_("Float8 argument passing:              %s\n"),
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile.max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 6ffe795..2a37993 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -581,20 +581,21 @@ GuessControlValues(void)
 	ControlFile.unloggedLSN = 1;
 
 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
 	ControlFile.relseg_size = RELSEG_SIZE;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
 	ControlFile.indexMaxKeys = INDEX_MAX_KEYS;
@@ -797,20 +798,21 @@ RewriteControlFile(void)
 	 * Force the defaults for max_* settings. The values don't really matter
 	 * as long as wal_level='minimal'; the postmaster will reset these fields
 	 * anyway at startup.
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
 	ControlFile.xlog_seg_size = XLogSegSize;
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile.crc);
 	COMP_CRC32C(ControlFile.crc,
 				(char *) &ControlFile,
 				offsetof(ControlFileData, crc));
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 2205d6e..b9f3d84 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
 	{ name, desc, identify},
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..29ba637
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,89 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#include "storage/backendid.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "foreign/fdwapi.h"
+
+/* This enum doubles as status and action */
+typedef enum
+{
+	/* Commit/Abort the foreign transaction using one-phase commit */
+	FDW_XACT_COMMIT,
+	FDW_XACT_ABORT,
+	/* Two-phase actions/statuses */
+	FDW_XACT_PREPARE,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMIT_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORT_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactAction;
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serveroid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	FDWXactAction	fdw_xact_action;	/* The state of the foreign transaction */
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serveroid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+extern bool	atomic_foreign_xact;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serveroid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..7272c33 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -37,11 +37,12 @@ PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify,
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
 PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index fbf9324..dc8768f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -199,20 +199,21 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /*
  * Information logged when we detect a change in one of the parameters
  * important for Hot Standby.
  */
 typedef struct xl_parameter_change
 {
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
 typedef struct xl_restore_point
 {
 	TimestampTz rp_time;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..d168c32 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -173,20 +173,21 @@ typedef struct ControlFileData
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
 	 * or hot standby.
 	 */
 	int			wal_level;
 	bool		wal_log_hints;
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
 	 * machine of different endianness, but we do need to worry about MAXALIGN
 	 * and floating-point format.  (Note: storage layout nominally also
 	 * depends on SHORTALIGN and INTALIGN, but in practice these are the same
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6b3d194..9c31fde 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5242,20 +5242,24 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 20 "2276" "{2276}" "{v}" _null_ _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4066 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4083 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3584 ( binary_upgrade_set_next_array_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_array_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3585 ( binary_upgrade_set_next_toast_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_toast_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3586 ( binary_upgrade_set_next_heap_pg_class_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_heap_pg_class_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..d2a8344 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -5,20 +5,21 @@
  *
  * Copyright (c) 2010-2015, PostgreSQL Global Development Group
  *
  * src/include/foreign/fdwapi.h
  *
  *-------------------------------------------------------------------------
  */
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/xact.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
 
 
 /*
  * Callback function signatures --- see fdwhandler.sgml for more info.
  */
@@ -110,20 +111,27 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*HandleForeignTransaction_function) (Oid serverOid, Oid userid,
+														XactEvent on_event, 
+														int prep_info_len,
+														char *prep_info);
+typedef char *(*GetPrepareId_function) (Oid serverOid, Oid userid,
+														int *prep_info_len);
+
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -165,20 +173,24 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for foreign transactions */
+	HandleForeignTransaction_function	HandleForeignTransaction;
+	GetPrepareId_function				GetPrepareId;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern Oid	GetForeignServerIdByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineByServerId(Oid serverid);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cff3b99..d03b119 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,22 +128,23 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define OldSerXidLock				(&MainLWLockArray[31].lock)
 #define SyncRepLock					(&MainLWLockArray[32].lock)
 #define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
 #define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define FDWXactLock					(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
  * here, but we need them to figure out offsets within MainLWLockArray, and
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index e807a2e..283917c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -209,25 +209,26 @@ typedef struct PROC_HDR
 } PROC_HDR;
 
 extern PROC_HDR *ProcGlobal;
 
 extern PGPROC *PreparedXactProcs;
 
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer, WAL writer and foreign transaction resolver
+ * run during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
 extern int	DeadlockTimeout;
 extern int	StatementTimeout;
 extern int	LockTimeout;
 extern bool log_lock_waits;
 
 
 /*
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 51f25a2..9fd8c79 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1242,11 +1242,14 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 60c1f40..fddd351 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1305,20 +1305,30 @@ pg_available_extensions| SELECT e.name,
     e.comment
    FROM (pg_available_extensions() e(name, default_version, comment)
      LEFT JOIN pg_extension x ON ((e.name = x.extname)));
 pg_cursors| SELECT c.name,
     c.statement,
     c.is_holdable,
     c.is_binary,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
     a.name,
     a.setting
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
     ARRAY( SELECT pg_auth_members.member
            FROM pg_auth_members
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index a267894..10dc7f2 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2224,37 +2224,40 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (system(buf))
 		{
 			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
 			exit(2);
 		}
 
 		/*
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_fdw_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
 		if (pg_conf == NULL)
 		{
 			fprintf(stderr, _("\n%s: could not open \"%s\" for adding extra config: %s\n"), progname, buf, strerror(errno));
 			exit(2);
 		}
 		fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
 		fputs("log_autovacuum_min_duration = 0\n", pg_conf);
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_fdw_transactions = 2\n", pg_conf);
 
 		if (temp_config != NULL)
 		{
 			FILE	   *extra_conf;
 			char		line_buf[1024];
 
 			extra_conf = fopen(temp_config, "r");
 			if (extra_conf == NULL)
 			{
 				fprintf(stderr, _("\n%s: could not open \"%s\" to read extra config: %s\n"), progname, temp_config, strerror(errno));

#30

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Ashutosh Bapat (#29)

Re: Transactions involving multiple postgres foreign servers

Overall, you seem to have made some significant progress on the design
since the last version of this patch. There's probably a lot left to
do, but the design seems more mature now. I haven't read the code,
but here are some comments based on the email.

On Thu, Jul 9, 2015 at 6:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The patch introduces a GUC atomic_foreign_transaction, which when ON ensures
atomic commit for foreign transactions, otherwise not. The value of this GUC
at the time of committing or preparing a local transaction is used. This
gives applications the flexibility to choose the behaviour as late in the
transaction as possible. This GUC has no effect if there are no foreign
servers involved in the transaction.

Hmm. I'm not crazy about that name, but I don't have a better idea either.

One thing about this design is that it makes atomicity a property of
the transaction rather than the server. That is, any given
transaction is either atomic with respect to all servers or atomic
with respect to none. You could also design this the other way: each
server is either configured to do atomic commit, or not. When a
transaction is committed, it prepares on those servers which are
configured for it, and then commits the others. So then you can have
a "partially atomic" transaction where, for example, you transfer
money from account A to account B (using one or more FDW connections
that support atomic commit) and also use twitter_fdw to tweet about it
(using an FDW connection that does NOT support atomic commit). The
tweet will survive even if the local commit fails, but that's OK. You
could even do this at table granularity: we'll prepare the transaction
if at least one foreign table involved in the transaction has
atomic_commit = true.

In some sense I think this might be a nicer design, because suppose
you connect to a foreign server and mostly just log stuff but
occasionally do important things there. In your design, you can do
this, but you'll need to make sure atomic_foreign_transaction is set
for the correct set of transactions. But in what I'm proposing here
we might be able to derive the correct value mostly automatically.

We should consider other possible designs as well; the choices we make
here may have a significant impact on usability.

Another GUC max_fdw_transactions sets the maximum number of transactions
that can be simultaneously prepared on all the foreign servers. This limits
the memory required for remembering the prepared foreign transactions.

How about max_prepared_foreign_transactions?

Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given
foreign server connection. An FDW which doesn't want to support this feature
can keep this hook undefined (NULL). When defined the hook should return a
unique identifier for the transaction prepared on the foreign server. The
identifier should be unique enough not to conflict with currently prepared
or future transactions. This point will be clear when discussing phase 2 of
2PC.

2. HandleForeignTransaction: to end a transaction in specified way. The hook
should be able to prepare/commit/rollback current running transaction on
given connection or commit/rollback a previously prepared transaction. This
is described in detail while describing phase two of two-phase commit. The
function is required to return a boolean status of whether the requested
operation was successful or not. The function or its minions should not
raise any error on failure so as not to interfere with the distributed
transaction processing. This point will be clarified more in the description
below.

HandleForeignTransaction is not very descriptive, and I think you're
jamming together things that ought to be separated. Let's have a
PrepareForeignTransaction and a ResolvePreparedForeignTransaction.

A foreign server, user mapping corresponding to an unresolved foreign
transaction is not allowed to be dropped or altered until the foreign
transaction is resolved. This is required to retain the connection
properties which need to resolve the prepared transaction on the foreign
server.

I agree with not letting it be dropped, but I think not letting it be
altered is a serious mistake. Suppose the foreign server dies in a
fire, its replica is promoted, and we need to re-point the master at
the replica's hostname or IP.

Handling non-atomic foreign transactions
===============================
When atomic_foreign_transaction is disabled, one-phase commit protocol is
used to commit/rollback the foreign transactions. After the local
transaction has committed/aborted, all the foreign transactions on the
registered foreign connections are committed or aborted resp. using hook
HandleForeignTransaction. Failing to commit a foreign transaction does not
affect the other foreign transactions; they are still tried to be committed
(if the local transaction commits).

Is this a change from the current behavior? What if we call the first
commit handler and it throws an ERROR? Presumably then nothing else
gets committed, and the transaction overall aborts.

PITR
====
PITR may rewind the database to a point before an xid associated with an
unresolved foreign transaction. There are two approaches to deal with the
situation.
1. Just forget about the unresolved foreign transaction and remove the file
just like we do for a prepared local transaction. But then the prepared
transaction on the foreign server might be left unresolved forever and will
keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the transaction id
without getting rid of prepared foreign transaction. If we do so, we might
create conflicting files in future and might resolve the transaction with
wrong outcome.

I don't think either of these is correct. The database shouldn't
behave differently when PITR is used than when it isn't. Otherwise
you are not doing what it says on the tin: recovering to the chosen
point in time. I recommend adding a function that forgets about a
foreign prepared transaction and making it the DBA's job to figure out
whether to call it in a particular scenario. After all, the remote
machine might have been subjected to PITR, too. Or maybe not. We
can't know, so we should give the DBA the tools to clean things up and
leave it at that.

IIUC LRO, the patch uses the local transaction as last resource, which is
always present. The fate of foreign transaction is decided by the fate of
the local transaction, which is not required to be prepared per say. There
is more relevant note later.

Personally, I think that's perfectly fine. We could do more later if
we wanted to, but there's plenty to like here without that.

Just to be clear: you also need two-phase commit if the transaction
updated anything in the local server and in even one foreign server.

Any local transaction involving a foreign sever transaction uses two-phase
commit for the foreign transaction. The local transaction is not prepared
per say. However, we should be able to optimize a case, when there are no
local changes. I am not able to find a way to deduce that there was no local
change, so I have left that case in this patch. Is there a way to know
whether a local transaction changed something locally or not?

You might check whether it wrote any WAL. There's a global variable
for that somewhere; RecordTransactionCommit() uses it. But I don't
think this is an essential optimization for v1, either.

I have used approach similar to pg_twophase, but implemented it as a
separate code, as the requirements differ. But, I would like to minimize
code by unifying both, if we finalise this design. Suggestions in this
regard will be very helpful.

-1 for trying to unify those unless it's really clear that it's a good
idea. I bet it's not.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code.

No. I bet that's also a bad idea. A non-transactional table is a
good idea that has been proposed before, but let's not try to invent
it in this patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Robert Haas (#30)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jul 17, 2015 at 10:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Overall, you seem to have made some significant progress on the design
since the last version of this patch. There's probably a lot left to
do, but the design seems more mature now. I haven't read the code,
but here are some comments based on the email.

Thanks for your comments.

I have incorporated most of your suggestions (marked as Done) in the
attached patch.

On Thu, Jul 9, 2015 at 6:18 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The patch introduces a GUC atomic_foreign_transaction, which when ON

ensures

atomic commit for foreign transactions, otherwise not. The value of this

GUC

at the time of committing or preparing a local transaction is used. This
gives applications the flexibility to choose the behaviour as late in the
transaction as possible. This GUC has no effect if there are no foreign
servers involved in the transaction.

Hmm. I'm not crazy about that name, but I don't have a better idea either.

One thing about this design is that it makes atomicity a property of
the transaction rather than the server. That is, any given
transaction is either atomic with respect to all servers or atomic
with respect to none. You could also design this the other way: each
server is either configured to do atomic commit, or not. When a
transaction is committed, it prepares on those servers which are
configured for it, and then commits the others. So then you can have
a "partially atomic" transaction where, for example, you transfer
money from account A to account B (using one or more FDW connections
that support atomic commit) and also use twitter_fdw to tweet about it
(using an FDW connection that does NOT support atomic commit). The
tweet will survive even if the local commit fails, but that's OK. You

could even do this at table granularity: we'll prepare the transaction

if at least one foreign table involved in the transaction has
atomic_commit = true.

In some sense I think this might be a nicer design, because suppose
you connect to a foreign server and mostly just log stuff but
occasionally do important things there. In your design, you can do
this, but you'll need to make sure atomic_foreign_transaction is set
for the correct set of transactions. But in what I'm proposing here
we might be able to derive the correct value mostly automatically.

A user may set atomic_foreign_transaction to ON to guarantee atomicity, IOW
it throws error when atomicity can not be guaranteed. Thus if application
accidentally does something to a foreign server, which doesn't support 2PC,
the transaction would abort. A user may set it to OFF (consciously and
takes the responsibility of the result) so as not to use 2PC (probably to
reduce the overheads) even if the foreign server is 2PC compliant. So, I
thought a GUC would be necessary. We can incorporate the behaviour you are
suggesting by having atomic_foreign_transaction accept three values "full"
(ON behaviour), "partial" (behaviour you are describing), "none" (OFF
behaviour). Default value of this GUC would be "partial". Will that be fine?

About table level atomic commit attribute, I agree that some foreign tables
might hold "more critical" data than others from the same server, but I am
not sure whether only that attribute should dictate the atomicity or not. A
transaction collectively might need to be "atomic" even if the individual
tables it modified are not set atomic_commit attribute. So, we need a
transaction level attribute for atomicity, which may be overridden by a
table level attribute. Should we add support to the table level atomicity
setting as version 2+?

We should consider other possible designs as well; the choices we make
here may have a significant impact on usability.

I looked at other RBDMSes like IBM's federated database or Oracle. They
support only "full" behaviour as described above with some optimizations
like LRO. But, I would like to hear about other options.

Another GUC max_fdw_transactions sets the maximum number of transactions
that can be simultaneously prepared on all the foreign servers. This

limits

the memory required for remembering the prepared foreign transactions.

How about max_prepared_foreign_transactions?

Done.

Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given
foreign server connection. An FDW which doesn't want to support this

feature

can keep this hook undefined (NULL). When defined the hook should return

a

unique identifier for the transaction prepared on the foreign server. The
identifier should be unique enough not to conflict with currently

prepared

or future transactions. This point will be clear when discussing phase 2

of

2PC.

2. HandleForeignTransaction: to end a transaction in specified way. The

hook

should be able to prepare/commit/rollback current running transaction on
given connection or commit/rollback a previously prepared transaction.

This

is described in detail while describing phase two of two-phase commit.

The

function is required to return a boolean status of whether the requested
operation was successful or not. The function or its minions should not
raise any error on failure so as not to interfere with the distributed
transaction processing. This point will be clarified more in the

description

below.

HandleForeignTransaction is not very descriptive, and I think you're
jamming together things that ought to be separated. Let's have a
PrepareForeignTransaction and a ResolvePreparedForeignTransaction.

Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more explanation
later)

A foreign server, user mapping corresponding to an unresolved foreign
transaction is not allowed to be dropped or altered until the foreign
transaction is resolved. This is required to retain the connection
properties which need to resolve the prepared transaction on the foreign
server.

I agree with not letting it be dropped, but I think not letting it be
altered is a serious mistake. Suppose the foreign server dies in a
fire, its replica is promoted, and we need to re-point the master at
the replica's hostname or IP.

Done

IP might be fine, but consider altering dbname option or dropping it; we
won't find the prepared foreign transaction in new database. I think we
should at least warn the user that there exist a prepared foreign
transaction on given foreign server or user mapping; better even if we let
FDW decide which options are allowed to be altered when there exists a
foreign prepared transaction. The later requires some surgery in the way we
handle the options.

Handling non-atomic foreign transactions
===============================
When atomic_foreign_transaction is disabled, one-phase commit protocol is
used to commit/rollback the foreign transactions. After the local
transaction has committed/aborted, all the foreign transactions on the
registered foreign connections are committed or aborted resp. using hook
HandleForeignTransaction. Failing to commit a foreign transaction does

not

affect the other foreign transactions; they are still tried to be

committed

(if the local transaction commits).

Is this a change from the current behavior?

There is no current behaviour defined per say. Each FDW is free to add its
transaction callbacks, which can commit/rollback their respective
transactions at pre-commit time or after the commit. postgres_fdw's
callback tries to commit the foreign transactions on PRE_COMMIT event and
throws error if that fails.

What if we call the first
commit handler and it throws an ERROR? Presumably then nothing else
gets committed, and the transaction overall aborts.

In this case, the fate of transaction depends upon the order in which
foreign transactions are committed, in turn the order in which the foreign
transactions are started. This can result in non-deterministic results. The
patch tries to give it a deterministic behaviour: commit whatever can be
committed and abort rest. This requires EndForeignTransaction
(HandleForeignTransaction in the earlier patch) hook not to raise error.
Although I do not know how to prevent it from throwing an error. We may try
catching the error and not rethrowing them. But I haven't tried that.

The same requirement goes with ResolvePreparedForeignTransaction(). If that
hook throws an error, we end up with unresolved prepared transactions,
which will be committed only when the resolver kicks in.

PITR

====
PITR may rewind the database to a point before an xid associated with an
unresolved foreign transaction. There are two approaches to deal with the
situation.
1. Just forget about the unresolved foreign transaction and remove the

file

just like we do for a prepared local transaction. But then the prepared
transaction on the foreign server might be left unresolved forever and

will

keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the

transaction id

without getting rid of prepared foreign transaction. If we do so, we

might

create conflicting files in future and might resolve the transaction with
wrong outcome.

I don't think either of these is correct. The database shouldn't
behave differently when PITR is used than when it isn't. Otherwise
you are not doing what it says on the tin: recovering to the chosen
point in time. I recommend adding a function that forgets about a
foreign prepared transaction and making it the DBA's job to figure out
whether to call it in a particular scenario. After all, the remote
machine might have been subjected to PITR, too. Or maybe not. We
can't know, so we should give the DBA the tools to clean things up and
leave it at that.

I have added a built-in pg_fdw_remove() (or any suitable name), which
removes the prepared foreign transaction entry from the memory and disk.
The function needs to be called before attempting PITR. If the recovery
points to a past time without removing file, we abort the recovery. In such
case, a DBA can remove the foreign prepared transaction file manually
before recovery. I have added a hint with that effect in the error message.
Is that enough?

I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which
resolve or remove unresolved prepared foreign transaction resp. are
effecting changes which can not be rolled back if the transaction which ran
these functions rolled back. These need to be converted into SQL command
like ROLLBACK PREPARED which can't be run within a transaction.

IIUC LRO, the patch uses the local transaction as last resource, which is
always present. The fate of foreign transaction is decided by the fate of
the local transaction, which is not required to be prepared per say.

There

is more relevant note later.

Personally, I think that's perfectly fine. We could do more later if
we wanted to, but there's plenty to like here without that.

Agreed.

Just to be clear: you also need two-phase commit if the transaction
updated anything in the local server and in even one foreign server.

Any local transaction involving a foreign sever transaction uses

two-phase

commit for the foreign transaction. The local transaction is not prepared
per say. However, we should be able to optimize a case, when there are no
local changes. I am not able to find a way to deduce that there was no

local

change, so I have left that case in this patch. Is there a way to know
whether a local transaction changed something locally or not?

You might check whether it wrote any WAL. There's a global variable
for that somewhere; RecordTransactionCommit() uses it. But I don't
think this is an essential optimization for v1, either.

Agreed.

I have used approach similar to pg_twophase, but implemented it as a
separate code, as the requirements differ. But, I would like to minimize
code by unifying both, if we finalise this design. Suggestions in this
regard will be very helpful.

-1 for trying to unify those unless it's really clear that it's a good
idea. I bet it's not.

Fine.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code.

No. I bet that's also a bad idea. A non-transactional table is a
good idea that has been proposed before, but let's not try to invent
it in this patch.

Agreed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchtext/x-patch; charset=US-ASCII; name=pg_fdw_transact.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/* 
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL; 
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid); 
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/* 
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..8b119a5 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -57,52 +59,59 @@ typedef struct ConnCacheEntry
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+									bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool is_server_twophase_compliant(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -116,23 +125,20 @@ GetConnection(ForeignServer *server, UserMapping *user,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
 		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
@@ -152,41 +158,64 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 	server->servername);
+			return NULL;
+		}
+
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+					bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
 	 */
 	PG_TRY();
 	{
 		const char **keywords;
 		const char **values;
@@ -227,25 +256,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
 			connmessage = pstrdup(PQerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+					   		server->servername),
+						errdetail_internal("%s", connmessage)));
 		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
 		if (!superuser() && !PQconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
@@ -362,29 +395,36 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid, entry->key.userid,
+									is_server_twophase_compliant(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,148 +546,295 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
+		
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key.serverid = serverid;
+		key.userid = userid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+	if (!conn && IsTransactionState())
+	{
+		ForeignServer	*foreign_server = GetForeignServer(serverid); 
+		UserMapping		*user_mapping = GetUserMapping(userid, serverid);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+		conn = GetConnection(foreign_server, user_mapping, false, false, true);
+	}
 
-			switch (event)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+	
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+	
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.  We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
 	 */
 	xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -708,10 +895,33 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * is_server_twophase_compliant
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+is_server_twophase_compliant(ForeignServer *server)
+{
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "twophase_compliant") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1f417b3..880dfba 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -1,20 +1,21 @@
 -- ===================================================================
 -- create FDW objects
 -- ===================================================================
 CREATE EXTENSION postgres_fdw;
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
 CREATE TYPE user_enum AS ENUM ('foo', 'bar', 'buz');
@@ -3634,10 +3635,433 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- tests with non-atomic foreign transactions (default)
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+(3 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+(4 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+(4 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+(5 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+(5 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+   4
+   6
+   7
+   8
+(7 rows)
+
+-- refill the local table for further tests.
+TRUNCATE TABLE lt;
+-- test with atomic foreign transactions
+SET atomic_foreign_transaction TO ON;
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_with_fdw           | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_with_fdw           | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     2
+(1 row)
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+ pg_fdw_remove 
+---------------
+ 
+(1 row)
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     0
+(1 row)
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+ database | foreign server | local user | status 
+----------+----------------+------------+--------
+(0 rows)
+
+SELECT database, gid FROM pg_prepared_xacts;
+ database | gid 
+----------+-----
+(0 rows)
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_fdw_subxact        | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_fdw_subxact        | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+TRUNCATE TABLE lt;
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+-- test with and without atomic foreign transaction
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+ERROR:  atomicity can not be guaranteed because some foreign server/s involved in transaction can not participate in two phase commit.
+SELECT * FROM lt;
+ val 
+-----
+(0 rows)
+
+SET atomic_foreign_transaction TO OFF;
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   1
+   2
+   3
+(3 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..ed956ab 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -98,21 +98,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "twophase_compliant") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -146,20 +147,22 @@ InitPgFdwOptions(void)
 		{"column_name", AttributeRelationId, false},
 		/* use_remote_estimate is available on both server and table */
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"twophase_compliant", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e4d799c..f574543 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,22 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -362,20 +364,26 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -918,21 +926,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1316,21 +1324,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1766,21 +1774,21 @@ estimate_path_cost_size(PlannerInfo *root,
 		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
 						 &retrieved_attrs);
 		if (fpinfo->remote_conds)
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2330,21 +2338,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2422,21 +2430,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2623,21 +2631,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2987,10 +2995,11 @@ static void
 conversion_error_callback(void *arg)
 {
 	ConversionLocation *errpos = (ConversionLocation *) arg;
 	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
 		errcontext("column \"%s\" of foreign table \"%s\"",
 				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
 				   RelationGetRelationName(errpos->rel));
 }
+
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..8d24359 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -10,30 +10,32 @@
  *
  *-------------------------------------------------------------------------
  */
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
 #include "utils/relcache.h"
+#include "access/fdw_xact.h"
 
 #include "libpq-fe.h"
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
@@ -67,12 +69,19 @@ extern void deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 				 List *targetAttrs, List *returningList,
 				 List **retrieved_attrs);
 extern void deparseDeleteSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *returningList,
 				 List **retrieved_attrs);
 extern void deparseAnalyzeSizeSql(StringInfo buf, Relation rel);
 extern void deparseAnalyzeSql(StringInfo buf, Relation rel,
 				  List **retrieved_attrs);
 extern void deparseStringLiteral(StringInfo buf, const char *val);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info);
 
 #endif   /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index fcdd92e..7045e55 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -2,21 +2,22 @@
 -- create FDW objects
 -- ===================================================================
 
 CREATE EXTENSION postgres_fdw;
 
 CREATE SERVER testserver1 FOREIGN DATA WRAPPER postgres_fdw;
 DO $d$
     BEGIN
         EXECUTE $$CREATE SERVER loopback FOREIGN DATA WRAPPER postgres_fdw
             OPTIONS (dbname '$$||current_database()||$$',
-                     port '$$||current_setting('port')||$$'
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
             )$$;
     END;
 $d$;
 
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -827,10 +828,278 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 twophase_compliant 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+INSERT INTO lt VALUES (1);
+INSERT INTO lt VALUES (3);
+
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- tests with non-atomic foreign transactions (default)
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- refill the local table for further tests.
+TRUNCATE TABLE lt;
+
+-- test with atomic foreign transactions
+SET atomic_foreign_transaction TO ON;
+
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+SELECT database, gid FROM pg_prepared_xacts;
+
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (2);
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- The transaction in this test doesn't violate any constraints.
+TRUNCATE TABLE lt;
+
+ALTER SERVER loopback2 OPTIONS (SET twophase_compliant 'false'); 
+-- test with and without atomic foreign transaction
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+SET atomic_foreign_transaction TO OFF;
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1);
+	INSERT INTO ft1_lt VALUES (2);
+	INSERT INTO ft2_lt VALUES (3);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b91d6c7..48cf2e2 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1440,20 +1440,62 @@ include_dir 'conf.d'
        </para>
 
        <para>
         When running a standby server, you must set this parameter to the
         same or higher value than on the master server. Otherwise, queries
         will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
        <primary><varname>work_mem</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
         Specifies the amount of memory to be used by internal sort operations
         and hash tables before writing to temporary disk files. The value
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2361577..9b931e4 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -918,20 +918,85 @@ ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);
      useful to test whether a given foreign-table name will pass the filter.
     </para>
 
     <para>
      If the FDW does not support importing table definitions, the
      <function>ImportForeignSchema</> pointer can be set to <literal>NULL</>.
     </para>
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-transactions">
+    <title>FDW Routines For transaction management</title>
+
+    <para>
+<programlisting>
+char *
+GetPrepareInfo (Oid serverOid, Oid userid, int *prep_info_len);
+</programlisting>
+
+     Get prepared transaction identifier for given foreign server and user.
+     This function is called when executing <xref linkend="sql-commit">, if
+     <literal>atomic_foreign_transaction</> is enabled. It should return a
+     valid transaction identifier that will be used to prepare transaction
+     on the foreign server. The <parameter>prep_info_len</> should be set
+     to the length of this identifier. The identifier should not be longer
+     than 256 bytes. The identifier should not cause conflict with existing
+     identifiers on the foreign server. It should be unique enough not to
+     identify a transaction in future. It's possible that a transaction is
+     considered unresolved on <productname>PostgreSQL</> while it is resolved
+     in reality. This causes the foreign transaction resolver to try resolving
+     the transaction till it finds out that the transaction has been resolved.
+     In case the transaction identifier is same as a future transaction identifier
+     there is a possibility of the future transaction getting resolved
+     erroneously.
+    </para>
+
+    <para>
+     If a foreign server with Foreign Data Wrapper having <literal>NULL</>
+      <function>GetPrepareInfo</> pointer participates in a transaction
+      with<literal>atomic_foreign_transaction</> enabled, the transaction
+      is aborted.
+    </para>
+
+    <para>
+<programlisting>
+bool
+HandleForeignTransaction (Oid serverOid, Oid userid, FDWXactAction action,
+                            int prep_id_len, char *prep_id)
+</programlisting>
+
+     Function to end a transaction on the given foreign server with given user.
+     This function is called when executing <xref linkend="sql-commit"> or
+     <xref linkend="sql-rollback">. The function should complete a transaction
+     according to the <parameter>action</> specified. The function should
+     return TRUE on successful completion of transaction and FALSE otherwise.
+     It should not throw an error in case of failure to complete the transaction.
+    </para>
+
+    <para>
+    When <parameter>action</> is FDW_XACT_COMMIT or FDW_XACT_ABORT, the function
+    should commit or rollback the running transaction resp. When <parameter>action</>
+    is FDW_XACT_PREPARE, the running transaction should be prepared with the
+    identifier given by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    When <parameter>action</> is FDW_XACT_ABORT_PREPARED or FDW_XACT_COMMIT_PREPARED
+    the function should respectively commit or rollback the transaction identified
+    by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    </para>
+
+    <para>
+    This function is usually called at the end of the transaction, when the
+    access to the database may not be possible. Trying to access catalogs
+    in this function may cause error to be thrown and can affect other foreign
+    data wrappers. 
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
     <title>Foreign Data Wrapper Helper Functions</title>
 
     <para>
      Several helper functions are exported from the core server so that
      authors of foreign data wrappers can get easy access to attributes of
      FDW-related objects, such as FDW options.
      To use any of these functions, you need to include the header file
@@ -1308,11 +1373,93 @@ GetForeignServerByName(const char *name, bool missing_ok);
     <para>
      See <filename>src/include/nodes/lockoptions.h</>, the comments
      for <type>RowMarkType</> and <type>PlanRowMark</>
      in <filename>src/include/nodes/plannodes.h</>, and the comments for
      <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
      additional information.
     </para>
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..8c1afcf 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -1,16 +1,16 @@
 #
 # Makefile for the rmgr descriptor routines
 #
 # src/backend/access/rmgrdesc/Makefile
 #
 
 subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o gindesc.o \
+	   gistdesc.o hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
 	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..0f0c899
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module describes the WAL records for foreign transaction manager. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 4f29136..ad07038 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -104,28 +104,29 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 			if (entry->val == xlrec.wal_level)
 			{
 				wal_level_str = entry->name;
 				break;
 			}
 		}
 
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamps=%s",
+						 "track_commit_timestamps=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
 	else if (info == XLOG_END_OF_RECOVERY)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 94455b2..51b2efd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o parallel.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..4ea0482
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2004 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * At the time of commit, GUC atomic_foreign_transactions controls whether we
+ * attempt to commit foreign transactions atomically along with the local
+ * transaction or not. 
+ *
+ * If atomic_foreign_transactions is enabled, Two-phase commit protocol is used
+ * to achieve atomicity. In the first phase transactions are prepared on all
+ * participating foreign servers. If first phase succeeds, foreign servers are
+ * requested to commit respective prepared transactions. If the first phase
+ * does not succeed because of any failure, the foreign servers are asked to
+ * rollback respective prepared transactions or abort the transactions if they
+ * are not prepared. Any network failure, server crash after preparing foreign
+ * transaction leaves that prepared transaction unresolved. During the first
+ * phase, before actually preparing the transactions, enough information is
+ * persisted to the disk and logs in order to resolve such transactions. The
+ * first phase is executed during pre-commit processing. The second phase is
+ * executed during post-commit processing or when processing an aborted
+ * transaction. 
+ *
+ * If atomic_foreign_transactions is disabled, One-phase commit protocol is
+ * used. During post-commit processing or whe processing an aborted transaction
+ * foreign servers are respectively asked to commit or rollback their
+ * transactions. Failures in executing this step on any single foreign server
+ * does not affect the other foreign servers. Thus if the local transaction
+ * commits, it is not guaranteed that all the participating foreign servers
+ * commit their respective transactions. 
+ */
+
+/* GUC which controls atomicity of transactions involving foreign servers */
+bool	atomic_foreign_xact = false;
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */ 
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction are two phase commit compliant.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_compliant)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine 		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_compliant;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u with user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(fdw_conn->serverid); 
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_compliant)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */ 
+	Oid				serveroid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */ 
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serveroid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serveroid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid,
+								Oid userid, int fdw_xact_info_len,
+								char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serveroid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * If user has requested atomicity for transactions involved
+ * (atomic_foreign_transaction GUC enabled), this
+ * function executes the first phase of 2PC.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/* Non-atomic foreign transactions, nothing do here */
+	if (!atomic_foreign_xact)
+		return;
+	/*
+	 * If user expects the transactions involving foreign servers to be atomic,
+	 * make sure that every foreign server can participate in two phase commit.
+	 * Checking this GUC value at the time of COMMIT allows user to set it
+	 * during the transaction depending upon the foreign servers involved.
+	 */
+	if (atomic_foreign_xact && !TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("atomicity can not be guaranteed because some foreign server/s involved in transaction can not participate in two phase commit.")));
+
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on all the connections to the foreign servers accessed
+ * by this transaction.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/* 
+	 * Loop over the foreign connections 
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+															fdw_conn->userid,
+															&fdw_xact_info_len);
+		
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 * 
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+											fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact; 
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serveroid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid; 
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serveroid = fdw_xact->serveroid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serveroid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */ 
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.  Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.  The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */ 
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id,
+					FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serveroid == serveroid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serveroid %u, userid %u found",
+						xid, serveroid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = serveroid;
+	fdw_xact->userid = userid;
+	fdw_xact->fdw_xact_status = fdw_xact_status; 
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */ 
+			fdw_remove_xlog.serveroid = fdw_xact->serveroid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serveroid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/* 
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *    according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *    succeeds, remove them from foreign transaction entries, otherwise unlock
+ *    them.
+ */ 
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+	/*
+	 * For non-atomic foreign transactions, commit/abort the transactions on
+	 * foreign server/s. For atomic foreign transactions, commit/abort the
+	 * prepared transactions.
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			/*
+			 * We should be preparing foreign transaction, only if we are in
+			 * aotmic foreign transaction mode.
+			 */
+			Assert(atomic_foreign_xact);
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * If we are in atomic foreign transaction mode and committing the
+			 * transaction, we should have prepared all the transaction. Only
+			 * when the transaction is aborted while the foreign transactions
+			 * are being prepared, we will end up here in atomic foreign
+			 * transaction mode.
+			 */
+			Assert(!atomic_foreign_xact || !is_commit);
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	if (!atomic_foreign_xact)
+	{
+		ereport(ERROR,
+			(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION,
+			 errmsg("atomic_foreign_xact should be enabled to prepared transaction involving foreign servers.")));
+	}
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/* Free the list of registered foreign servers */
+	MyFDWConnections = NIL;
+}
+
+/* 
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool 
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serveroid, fdw_xact->userid,
+								is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+	
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid     | dbid    | serverid | userid  | search for entry with
+ * invalid | invalid | invalid  | invalid | nothing
+ * invalid | invalid | invalid  | valid   | given userid
+ * invalid | invalid | valid    | invalid | given serverid
+ * invalid | invalid | valid    | valid   | given serverid and userid
+ * invalid | valid   | invalid  | invalid | given dbid
+ * invalid | valid   | invalid  | valid   | given dbid and userid
+ * invalid | valid   | valid    | invalid | given dbid and serverid
+ * invalid | valid   | valid    | valid   | given dbid, servroid and userid
+ * valid   | invalid | invalid  | invalid | given xid
+ * valid   | invalid | invalid  | valid   | given xid and userid
+ * valid   | invalid | valid    | invalid | given xid, serverid
+ * valid   | invalid | valid    | valid   | given xid, serverid, userid
+ * valid   | valid   | invalid  | invalid | given xid and dbid 
+ * valid   | valid   | invalid  | valid   | given xid, dbid and userid
+ * valid   | valid   | valid    | invalid | given xid, dbid, serverid
+ * valid   | valid   | valid    | valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+		
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serveroid)
+			entry_matches = false;
+		
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+	
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	  		*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+		
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serveroid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+	
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+	
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+		
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serveroid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serveroids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serveroids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serveroids[nxacts] = fdw_xact->serveroid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serveroids[cnt], userids[cnt]);
+			
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serveroids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serveroids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+	
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+	
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+	
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+			
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+	
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+	
+	
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+	
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *    foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									PG_GETARG_TRANSACTIONID(XID_ARGNUM);
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+	
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serveroid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serveroid != serveroid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+	
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *    conflicting file names.
+ *
+ * While it's possible to avoid 1 by recoding the transaction status in the
+ * file, 2 looks unavoidable.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function throws error, unlike PrescanPreparedTransactions. While it
+ * suffices to remove a two-phase file, it is not the case with
+ * foreign prepared transaction file, which merely points to a prepared
+ * transaction on the foreign server. Removing such file would make PostgreSQL
+ * forget the prepared foreign transaction, which might remain unresolved
+ * forever. 
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			/*
+			 * Throw error if the transaction which prepared this foreign
+			 * transaction is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+				ereport(ERROR,
+					  (errmsg("a future foreign prepared transaction file \"%s\" found.",
+							  clde->d_name),
+					   errhint("to continue, remove the file. Corresponding foreign prepared transaction needs to be manually resolved if not already done.")));
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+	
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serveroid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serveroid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										serveroid, userid,
+										fdw_xact_file_data->fdw_xact_id_len,
+										fdw_xact_file_data->fdw_xact_id,
+										FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */	
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+	
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..cdbc583 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -7,20 +7,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 177d1e1..5c9aec7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -34,20 +34,21 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <time.h>
 #include <unistd.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
@@ -1469,20 +1470,26 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		RelationCacheInitFilePostInvalidate();
 
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
 	else
 		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
 	/*
 	 * And now we can clean up our mess.
 	 */
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
 	MyLockedGxact = NULL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b53d95f..aaa0edc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -14,20 +14,21 @@
  *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -179,20 +180,24 @@ typedef struct TransactionStateData
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -1884,20 +1889,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -1940,20 +1948,23 @@ CommitTransaction(void)
 
 		/*
 		 * Close open portals (converting holdable ones into static portals).
 		 * If there weren't any, we are done ... otherwise loop back to check
 		 * if they queued deferred triggers.  Lather, rinse, repeat.
 		 */
 		if (!PreCommit_Portals(false))
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
 	/*
 	 * The remaining actions cannot call any user-defined code, so it's safe
 	 * to start shutting down within-transaction services.  But note that most
 	 * of this stuff could still throw an error, which would switch us into
 	 * the transaction-abort path.
 	 */
 
@@ -2099,20 +2110,21 @@ CommitTransaction(void)
 	AtEOXact_GUC(true, 1);
 	AtEOXact_SPI(true);
 	AtEOXact_on_commit_actions(true);
 	AtEOXact_Namespace(true, is_parallel_worker);
 	AtEOXact_SMgr();
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
 	TopTransactionResourceOwner = NULL;
 
 	AtCommit_Memory();
 
@@ -2283,20 +2295,21 @@ PrepareTransaction(void)
 	 * before or after releasing the transaction's locks.
 	 */
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
 	 *
 	 * We have to record transaction prepares even if we didn't make any
 	 * updates, because the transaction manager might get confused if we lose
 	 * a global transaction.
 	 */
 	EndPrepare(gxact);
 
@@ -2565,20 +2578,21 @@ AbortTransaction(void)
 
 		AtEOXact_GUC(false, 1);
 		AtEOXact_SPI(false);
 		AtEOXact_on_commit_actions(false);
 		AtEOXact_Namespace(false, is_parallel_worker);
 		AtEOXact_SMgr();
 		AtEOXact_Files();
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
 	RESUME_INTERRUPTS();
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1dd31b3..120d897 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -16,20 +16,21 @@
 
 #include <ctype.h>
 #include <time.h>
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -4847,20 +4848,21 @@ BootStrapXLOG(void)
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
 	WriteControlFile();
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5852,20 +5854,26 @@ CheckRequiredParameterValues(void)
 									 ControlFile->max_worker_processes);
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
 		RecoveryRequiresBoolParameter("track_commit_timestamp",
 									  track_commit_timestamp,
 									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
 StartupXLOG(void)
 {
 	XLogCtlInsert *Insert;
@@ -6508,21 +6516,24 @@ StartupXLOG(void)
 		{
 			TransactionId *xids;
 			int			nxids;
 
 			ereport(DEBUG1,
 					(errmsg("initializing for hot standby")));
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
 
 			/* Tell procarray about the range of xids it has to deal with */
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
 			 * Startup commit log, commit timestamp and subtrans only.
 			 * MultiXact has already been started up and other SLRUs are not
@@ -7108,20 +7119,21 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7305,20 +7317,26 @@ StartupXLOG(void)
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
 	TrimCLOG();
 	TrimMultiXact();
 
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
 		close(readFile);
@@ -8579,20 +8597,25 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
 	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
  * Save a checkpoint for recovery restart if appropriate
  *
  * This function is called each time a checkpoint record is read from XLOG.
  * It must determine whether the checkpoint represents a safe restartpoint or
  * not.  If so, the checkpoint record is stashed in shared memory so that
  * CreateRestartPoint can consult it.  (Note that the latter function is
  * executed by the checkpointer, while this one will be executed by the
@@ -9004,56 +9027,59 @@ XLogRestorePoint(const char *rpName)
  */
 static void
 XLogReportParameters(void)
 {
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
 		 * if archiving is not enabled, as you can't start archive recovery
 		 * with wal_level=minimal anyway. We don't really care about the
 		 * values in pg_control either if wal_level=minimal, but seems better
 		 * to keep them up-to-date to avoid confusion.
 		 */
 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
 		{
 			xl_parameter_change xlrec;
 			XLogRecPtr	recptr;
 
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 			recptr = XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE);
 			XLogFlush(recptr);
 		}
 
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
 
 /*
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
  * Note: this function assumes there is no other process running
  * concurrently that could update it.
@@ -9227,20 +9253,21 @@ xlog_redo(XLogReaderState *record)
 		 */
 		if (standbyState >= STANDBY_INITIALIZED)
 		{
 			TransactionId *xids;
 			int			nxids;
 			TransactionId oldestActiveXID;
 			TransactionId latestCompletedXid;
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
 			 * down server, with only prepared transactions still alive. We're
 			 * never overflowed at this point because all subxids are listed
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
 			running.subxcnt = 0;
 			running.subxid_overflow = false;
@@ -9416,20 +9443,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
 		 * recover back up to this point before allowing hot standby again.
 		 * This is particularly important if wal_level was set to 'archive'
 		 * before, and is now 'hot_standby', to ensure you don't run queries
 		 * against the WAL preceding the wal_level change. Same applies to
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 95d6c14..3100f50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -11,20 +11,21 @@
  *	  src/backend/bootstrap/bootstrap.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/htup_details.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index e82a53a..b72364c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -234,20 +234,29 @@ CREATE VIEW pg_available_extension_versions AS
            LEFT JOIN pg_extension AS X
              ON E.name = X.extname AND E.version = X.extversion;
 
 CREATE VIEW pg_prepared_xacts AS
     SELECT P.transaction, P.gid, P.prepared,
            U.rolname AS owner, D.datname AS database
     FROM pg_prepared_xact() AS P
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier" 
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
 	CASE WHEN rel.relkind = 'r' THEN 'table'::text
 		 WHEN rel.relkind = 'v' THEN 'view'::text
 		 WHEN rel.relkind = 'm' THEN 'materialized view'::text
 		 WHEN rel.relkind = 'S' THEN 'sequence'::text
@@ -923,10 +932,18 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'make_interval';
 
 CREATE OR REPLACE FUNCTION
   jsonb_set(jsonb_in jsonb, path text[] , replacement jsonb,
             create_if_missing boolean DEFAULT true)
 RETURNS jsonb
 LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'jsonb_set';
+
+CREATE OR REPLACE FUNCTION
+  pg_fdw_remove(transaction xid DEFAULT NULL, dbid oid DEFAULT NULL,
+				serverid oid DEFAULT NULL, userid oid DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+VOLATILE
+AS 'pg_fdw_remove';
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index cc912b2..3408252 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -6,20 +6,21 @@
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  *
  *
  * IDENTIFICATION
  *	  src/backend/commands/foreigncmds.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
@@ -1080,20 +1081,34 @@ RemoveForeignServerById(Oid srvId)
 	HeapTuple	tp;
 	Relation	rel;
 
 	rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
 
 	tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
 
 	heap_close(rel, RowExclusiveLock);
 }
 
 
 /*
  * Common routine to check permission for user-mapping-related DDL
@@ -1252,20 +1267,21 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
 
 	umId = GetSysCacheOid2(USERMAPPINGUSERSERVER,
 						   ObjectIdGetDatum(useId),
 						   ObjectIdGetDatum(srv->serverid));
 	if (!OidIsValid(umId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("user mapping \"%s\" does not exist for the server",
 						MappingUserName(useId))));
 
+
 	user_mapping_ddl_aclcheck(useId, srv->serverid, stmt->servername);
 
 	tp = SearchSysCacheCopy1(USERMAPPINGOID, ObjectIdGetDatum(umId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for user mapping %u", umId);
 
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
@@ -1378,20 +1394,31 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 		/* IF EXISTS specified, just note it */
 		ereport(NOTICE,
 		(errmsg("user mapping \"%s\" does not exist for the server, skipping",
 				MappingUserName(useId))));
 		return InvalidOid;
 	}
 
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
 	object.objectId = umId;
 	object.objectSubId = 0;
 
 	performDeletion(&object, DROP_CASCADE, 0);
 
 	return umId;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1bb3138..01594c1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -86,20 +86,21 @@
 #ifdef USE_BONJOUR
 #include <dns_sd.h>
 #endif
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
 #include <pthread.h>
 #endif
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "lib/ilist.h"
 #include "libpq/auth.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
@@ -2447,21 +2448,20 @@ pmdie(SIGNAL_ARGS)
 				SignalUnconnectedWorkers(SIGTERM);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
 				/* and the bgwriter too */
 				if (BgWriterPID != 0)
 					signal_child(BgWriterPID, SIGTERM);
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-
 				/*
 				 * If we're in recovery, we can't kill the startup process
 				 * right away, because at present doing so does not release
 				 * its locks.  We might want to change this in a future
 				 * release.  For the time being, the PM_WAIT_READONLY state
 				 * indicates that we're waiting for the regular (read only)
 				 * backends to die off; once they do, we'll kill the startup
 				 * and walreceiver processes.
 				 */
 				pmState = (pmState == PM_RUN) ?
@@ -5689,20 +5689,21 @@ PostmasterMarkPIDForWorkerNotify(int pid)
 
 	dlist_foreach(iter, &BackendList)
 	{
 		bp = dlist_container(Backend, elem, iter.cur);
 		if (bp->pid == pid)
 		{
 			bp->bgworker_notify = true;
 			return true;
 		}
 	}
+
 	return false;
 }
 
 #ifdef EXEC_BACKEND
 
 /*
  * The following need to be available to the save/restore_backend_variables
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..6fdd818 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -127,20 +127,21 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_MULTIXACT_ID:
 		case RM_RELMAP_ID:
 		case RM_BTREE_ID:
 		case RM_HASH_ID:
 		case RM_GIN_ID:
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
 }
 
 /*
  * Handle rmgr XLOG_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a790e5b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -132,20 +133,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, ReplicationOriginShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -243,20 +245,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index 6b61765..d6cba87 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -15,21 +15,20 @@
 #include "postgres.h"
 
 #include <limits.h>
 
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 
-#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 #define PG_RETURN_TRANSACTIONID(x)	return TransactionIdGetDatum(x)
 
 #define PG_GETARG_COMMANDID(n)		DatumGetCommandId(PG_GETARG_DATUM(n))
 #define PG_RETURN_COMMANDID(x)		return CommandIdGetDatum(x)
 
 
 Datum
 xidin(PG_FUNCTION_ARGS)
 {
 	char	   *str = PG_GETARG_CSTRING(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1bed525..9695032 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -20,20 +20,21 @@
 #include <float.h>
 #include <math.h>
 #include <limits.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #ifdef HAVE_SYSLOG
 #include <syslog.h>
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
 #include "commands/trigger.h"
@@ -1611,20 +1612,31 @@ static struct config_bool ConfigureNamesBool[] =
 		{"data_checksums", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows whether data checksums are turned on for this cluster."),
 			NULL,
 			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
 		},
 		&data_checksums,
 		false,
 		NULL, NULL, NULL
 	},
 
+	{
+		{"atomic_foreign_transaction", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Transactions involving foreign servers are atomic."),
+			NULL,
+			GUC_NOT_IN_SAMPLE | GUC_DISALLOW_IN_FILE
+		},
+		&atomic_foreign_xact,
+		false,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, false, NULL, NULL, NULL
 	}
 };
 
 
 static struct config_int ConfigureNamesInt[] =
 {
 	{
@@ -1999,20 +2011,33 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
 			NULL
 		},
 		&max_prepared_xacts,
 		0, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Sets the minimum OID of tables for tracking locks."),
 			gettext_noop("Is used to avoid output on system tables."),
 			GUC_NOT_IN_SAMPLE
 		},
 		&Trace_lock_oidmin,
 		FirstNormalObjectId, 0, INT_MAX,
 		NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 06dfc06..6d812cc 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -117,20 +117,26 @@
 					# (change requires restart)
 #huge_pages = try			# on, off, or try
 					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
 # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
 # per transaction slot, plus lock space (see max_locks_per_transaction).
 # It is not advisable to set max_prepared_transactions nonzero unless you
 # actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0		# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature. 
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
 					# supported by the operating system:
 					#   posix
 					#   sysv
 					#   windows
 					#   mmap
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index feeff9e..47ecf1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,31 +192,32 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
 	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d8cfe5e..00aad71 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -324,12 +324,14 @@ main(int argc, char *argv[])
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile.loblksize);
 	printf(_("Date/time type storage:               %s\n"),
 		   (ControlFile.enableIntTimes ? _("64-bit integers") : _("floating-point numbers")));
 	printf(_("Float4 argument passing:              %s\n"),
 		   (ControlFile.float4ByVal ? _("by value") : _("by reference")));
 	printf(_("Float8 argument passing:              %s\n"),
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile.max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index e7e8059..880e895 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -581,20 +581,21 @@ GuessControlValues(void)
 	ControlFile.unloggedLSN = 1;
 
 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
 	ControlFile.relseg_size = RELSEG_SIZE;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
 	ControlFile.indexMaxKeys = INDEX_MAX_KEYS;
@@ -797,20 +798,21 @@ RewriteControlFile(void)
 	 * Force the defaults for max_* settings. The values don't really matter
 	 * as long as wal_level='minimal'; the postmaster will reset these fields
 	 * anyway at startup.
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
 	ControlFile.xlog_seg_size = XLogSegSize;
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile.crc);
 	COMP_CRC32C(ControlFile.crc,
 				(char *) &ControlFile,
 				offsetof(ControlFileData, crc));
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 2205d6e..b9f3d84 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
 	{ name, desc, identify},
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..8b55e48
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,74 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#include "storage/backendid.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "foreign/fdwapi.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serveroid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serveroid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+extern bool	atomic_foreign_xact;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serveroid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..7272c33 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -37,11 +37,12 @@ PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify,
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
 PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb1c2db..d614ab6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -296,20 +296,21 @@ typedef struct xl_xact_parsed_abort
 	RelFileNode *xnodes;
 
 	TransactionId twophase_xid; /* only for 2PC */
 } xl_xact_parsed_abort;
 
 
 /* ----------------
  *		extern definitions
  * ----------------
  */
+#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 extern bool IsTransactionState(void);
 extern bool IsAbortedTransactionBlockState(void);
 extern TransactionId GetTopTransactionId(void);
 extern TransactionId GetTopTransactionIdIfAny(void);
 extern TransactionId GetCurrentTransactionId(void);
 extern TransactionId GetCurrentTransactionIdIfAny(void);
 extern TransactionId GetStableLatestTransactionId(void);
 extern SubTransactionId GetCurrentSubTransactionId(void);
 extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 5ebaa5f..c4d80e6 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -206,20 +206,21 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /*
  * Information logged when we detect a change in one of the parameters
  * important for Hot Standby.
  */
 typedef struct xl_parameter_change
 {
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
 typedef struct xl_restore_point
 {
 	TimestampTz rp_time;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..d168c32 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -173,20 +173,21 @@ typedef struct ControlFileData
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
 	 * or hot standby.
 	 */
 	int			wal_level;
 	bool		wal_log_hints;
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
 	 * machine of different endianness, but we do need to worry about MAXALIGN
 	 * and floating-point format.  (Note: storage layout nominally also
 	 * depends on SHORTALIGN and INTALIGN, but in practice these are the same
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fd1278..e4569d5 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5252,20 +5252,26 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 20 "2276" "{2276}" "{v}" _null_ _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4066 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4083 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4099 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3584 ( binary_upgrade_set_next_array_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_array_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3585 ( binary_upgrade_set_next_toast_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_toast_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3586 ( binary_upgrade_set_next_heap_pg_class_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_heap_pg_class_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..d1ddb4e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -5,20 +5,21 @@
  *
  * Copyright (c) 2010-2015, PostgreSQL Global Development Group
  *
  * src/include/foreign/fdwapi.h
  *
  *-------------------------------------------------------------------------
  */
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/xact.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
 
 
 /*
  * Callback function signatures --- see fdwhandler.sgml for more info.
  */
@@ -110,20 +111,32 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*EndForeignTransaction_function) (Oid serverOid, Oid userid,
+													bool is_commit);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverOid, Oid userid,
+														int prep_info_len,
+														char *prep_info);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverOid, Oid userid,
+															bool is_commit,
+														int prep_info_len,
+														char *prep_info);
+typedef char *(*GetPrepareId_function) (Oid serverOid, Oid userid,
+														int *prep_info_len);
+
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -165,20 +178,26 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for foreign transactions */
+	GetPrepareId_function				GetPrepareId;
+	EndForeignTransaction_function		EndForeignTransaction;
+	PrepareForeignTransaction_function	PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern Oid	GetForeignServerIdByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineByServerId(Oid serverid);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cff3b99..d03b119 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,22 +128,23 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define OldSerXidLock				(&MainLWLockArray[31].lock)
 #define SyncRepLock					(&MainLWLockArray[32].lock)
 #define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
 #define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define FDWXactLock					(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
  * here, but we need them to figure out offsets within MainLWLockArray, and
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index e807a2e..283917c 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -209,25 +209,26 @@ typedef struct PROC_HDR
 } PROC_HDR;
 
 extern PROC_HDR *ProcGlobal;
 
 extern PGPROC *PreparedXactProcs;
 
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer, WAL writer and foreign transaction resolver
+ * run during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
 extern int	DeadlockTimeout;
 extern int	StatementTimeout;
 extern int	LockTimeout;
 extern bool log_lock_waits;
 
 
 /*
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fcb0bf0..3bddb46 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1245,11 +1245,15 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index cd53375..3728ed2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1305,20 +1305,30 @@ pg_available_extensions| SELECT e.name,
     e.comment
    FROM (pg_available_extensions() e(name, default_version, comment)
      LEFT JOIN pg_extension x ON ((e.name = x.extname)));
 pg_cursors| SELECT c.name,
     c.statement,
     c.is_holdable,
     c.is_binary,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
     a.name,
     a.setting,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index ed8c369..0f21f05 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2224,37 +2224,40 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (system(buf))
 		{
 			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, temp_instance, buf);
 			exit(2);
 		}
 
 		/*
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
 		if (pg_conf == NULL)
 		{
 			fprintf(stderr, _("\n%s: could not open \"%s\" for adding extra config: %s\n"), progname, buf, strerror(errno));
 			exit(2);
 		}
 		fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
 		fputs("log_autovacuum_min_duration = 0\n", pg_conf);
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		if (temp_config != NULL)
 		{
 			FILE	   *extra_conf;
 			char		line_buf[1024];
 
 			extra_conf = fopen(temp_config, "r");
 			if (extra_conf == NULL)
 			{
 				fprintf(stderr, _("\n%s: could not open \"%s\" to read extra config: %s\n"), progname, temp_config, strerror(errno));

#32

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Ashutosh Bapat (#31)

Re: Transactions involving multiple postgres foreign servers

On Wed, Jul 29, 2015 at 6:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

A user may set atomic_foreign_transaction to ON to guarantee atomicity, IOW
it throws error when atomicity can not be guaranteed. Thus if application
accidentally does something to a foreign server, which doesn't support 2PC,
the transaction would abort. A user may set it to OFF (consciously and takes
the responsibility of the result) so as not to use 2PC (probably to reduce
the overheads) even if the foreign server is 2PC compliant. So, I thought a
GUC would be necessary. We can incorporate the behaviour you are suggesting
by having atomic_foreign_transaction accept three values "full" (ON
behaviour), "partial" (behaviour you are describing), "none" (OFF
behaviour). Default value of this GUC would be "partial". Will that be fine?

I don't really see the point. If the user attempts a distributed
transaction involving FDWs that can't support atomic foreign
transactions, then I think it's reasonable to assume that they want
that to work rather than arbitrarily fail. The only situation in
which it's desirable for that to fail is when the user doesn't realize
that the FDW in question doesn't support atomic foreign commit, and
the error message warns them that their assumptions are unfounded.
But can't the user find that out easily enough by reading the
documentation? So I think that in practice the "full" value of this
GUC would get almost zero use; I think that nearly everyone will be
happy with what you are here calling "partial" or "none". I'll defer
to any other consensus that emerges, but that's my view.

I think that we should not change the default behavior. Currently,
the only behavior is not to attempt 2PC. Let's stick with that.

About table level atomic commit attribute, I agree that some foreign tables
might hold "more critical" data than others from the same server, but I am
not sure whether only that attribute should dictate the atomicity or not. A
transaction collectively might need to be "atomic" even if the individual
tables it modified are not set atomic_commit attribute. So, we need a
transaction level attribute for atomicity, which may be overridden by a
table level attribute. Should we add support to the table level atomicity
setting as version 2+?

I'm not hung up on the table-level attribute, but I think having a
server-level attribute rather than a global GUC is a good idea.
However, I welcome other thoughts on that.

We should consider other possible designs as well; the choices we make
here may have a significant impact on usability.

I looked at other RBDMSes like IBM's federated database or Oracle. They
support only "full" behaviour as described above with some optimizations
like LRO. But, I would like to hear about other options.

Yes, I hope others will weigh in.

HandleForeignTransaction is not very descriptive, and I think you're
jamming together things that ought to be separated. Let's have a
PrepareForeignTransaction and a ResolvePreparedForeignTransaction.

Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more explanation
later)

(2) and (3) seem like the same thing. I don't see any further
explanation later in your email; what am I missing?

IP might be fine, but consider altering dbname option or dropping it; we
won't find the prepared foreign transaction in new database.

Probably not, but I think that's the DBA's problem, not ours.

I think we
should at least warn the user that there exist a prepared foreign
transaction on given foreign server or user mapping; better even if we let
FDW decide which options are allowed to be altered when there exists a
foreign prepared transaction. The later requires some surgery in the way we
handle the options.

We can consider that, but I don't think it's an essential part of the
patch, and I'd punt it for now in the interest of keeping this as
simple as possible.

Is this a change from the current behavior?

There is no current behaviour defined per say.

My point is that you had some language in the email describing what
happens if the GUC is turned off. You shouldn't have to describe
that, because there should be absolutely zero difference. If there
isn't, that's a problem for this patch, and probably a subject for a
different one.

I have added a built-in pg_fdw_remove() (or any suitable name), which
removes the prepared foreign transaction entry from the memory and disk. The
function needs to be called before attempting PITR. If the recovery points
to a past time without removing file, we abort the recovery. In such case, a
DBA can remove the foreign prepared transaction file manually before
recovery. I have added a hint with that effect in the error message. Is that
enough?

That seems totally broken. Before PITR, the database might be
inconsistent, in which case you can't call any functions at all.
Also, you shouldn't be trying to resolve any transactions until the
end of recovery, because you don't know when you see that the
transaction was prepared whether, at some subsequent time, you will
see it resolved. You need to finish recovery and, only after entering
normal running, decide whether to resolve any transactions that are
still sitting around. There should be no situation (short of e.g. OS
errors writing the state files) where this stuff makes recovery fail.

I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which
resolve or remove unresolved prepared foreign transaction resp. are
effecting changes which can not be rolled back if the transaction which ran
these functions rolled back. These need to be converted into SQL command
like ROLLBACK PREPARED which can't be run within a transaction.

Yeah, maybe. I'm not sure using a functional interface is all that
bad, but we could think about changing it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Robert Haas (#32)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 30, 2015 at 1:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jul 29, 2015 at 6:58 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

A user may set atomic_foreign_transaction to ON to guarantee atomicity,

IOW

it throws error when atomicity can not be guaranteed. Thus if application
accidentally does something to a foreign server, which doesn't support

2PC,

the transaction would abort. A user may set it to OFF (consciously and

takes

the responsibility of the result) so as not to use 2PC (probably to

reduce

the overheads) even if the foreign server is 2PC compliant. So, I

thought a

GUC would be necessary. We can incorporate the behaviour you are

suggesting

by having atomic_foreign_transaction accept three values "full" (ON
behaviour), "partial" (behaviour you are describing), "none" (OFF
behaviour). Default value of this GUC would be "partial". Will that be

fine?

I don't really see the point. If the user attempts a distributed
transaction involving FDWs that can't support atomic foreign
transactions, then I think it's reasonable to assume that they want
that to work rather than arbitrarily fail. The only situation in
which it's desirable for that to fail is when the user doesn't realize
that the FDW in question doesn't support atomic foreign commit, and
the error message warns them that their assumptions are unfounded.
But can't the user find that out easily enough by reading the
documentation? So I think that in practice the "full" value of this
GUC would get almost zero use; I think that nearly everyone will be
happy with what you are here calling "partial" or "none". I'll defer
to any other consensus that emerges, but that's my view.

I think that we should not change the default behavior. Currently,
the only behavior is not to attempt 2PC. Let's stick with that.

Ok. I will remove the GUC and have "partial atomic" behaviour as you
suggested in previous mail.

About table level atomic commit attribute, I agree that some foreign

tables

might hold "more critical" data than others from the same server, but I

am

not sure whether only that attribute should dictate the atomicity or

not. A

transaction collectively might need to be "atomic" even if the individual
tables it modified are not set atomic_commit attribute. So, we need a
transaction level attribute for atomicity, which may be overridden by a
table level attribute. Should we add support to the table level atomicity
setting as version 2+?

I'm not hung up on the table-level attribute, but I think having a
server-level attribute rather than a global GUC is a good idea.
However, I welcome other thoughts on that.

The patch supports server level attribute. Let me repeat the relevant
description from my earlier mail
--
Every FDW needs to register the connection while starting new transaction
on a foreign connection (RegisterXactForeignServer()). A foreign server
connection is identified by foreign server oid and the local user oid
(similar to the entry cached by postgres_fdw). *While registering, FDW also
tells whether the foreign server is capable of participating in two-phase
commit protocol.* How to decide that is left entirely to the FDW. An FDW
like file_fdw may not have 2PC support at all, so all its foreign servers
do not comply with 2PC. An FDW might have all its servers 2PC compliant. An
FDW like postgres_fdw can have some of its servers compliant and some not,
depending upon server version, configuration (max_prepared_transactions =
0) etc.
--

Does that look good?

We should consider other possible designs as well; the choices we make
here may have a significant impact on usability.

I looked at other RBDMSes like IBM's federated database or Oracle. They
support only "full" behaviour as described above with some optimizations
like LRO. But, I would like to hear about other options.

Yes, I hope others will weigh in.

HandleForeignTransaction is not very descriptive, and I think you're
jamming together things that ought to be separated. Let's have a
PrepareForeignTransaction and a ResolvePreparedForeignTransaction.

Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more

explanation

later)

(2) and (3) seem like the same thing. I don't see any further
explanation later in your email; what am I missing?

In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz'
(fill the prepared transaction id) and 3 always fires COMMIT/ABORT
TRANSACTION (notice absence of PREPARED and 'xyz'). We might want to
combine them into a single hook but there are slight differences there
depending upon the FDW. For postgres_fdw, 2 should get a connection which
should not have a running transaction, whereas for 3 there has to be a
running transaction on that connection. Hook 2 should get prepared foreign
transaction identifier as one of the arguments, hook 3 will not have that
argument. Hook 2 will be relevant for two-phase commit protocol where as 3
will be used for connections not using two-phase commit.

The differences are much more visible in the code.

IP might be fine, but consider altering dbname option or dropping it; we
won't find the prepared foreign transaction in new database.

Probably not, but I think that's the DBA's problem, not ours.

Fine.

I think we
should at least warn the user that there exist a prepared foreign
transaction on given foreign server or user mapping; better even if we

let

FDW decide which options are allowed to be altered when there exists a
foreign prepared transaction. The later requires some surgery in the way

we

handle the options.

We can consider that, but I don't think it's an essential part of the
patch, and I'd punt it for now in the interest of keeping this as
simple as possible.

Fine.

Is this a change from the current behavior?

There is no current behaviour defined per say.

My point is that you had some language in the email describing what
happens if the GUC is turned off. You shouldn't have to describe
that, because there should be absolutely zero difference. If there
isn't, that's a problem for this patch, and probably a subject for a
different one.

Ok got it.

I have added a built-in pg_fdw_remove() (or any suitable name), which
removes the prepared foreign transaction entry from the memory and disk.

The

function needs to be called before attempting PITR. If the recovery

points

to a past time without removing file, we abort the recovery. In such

case, a

DBA can remove the foreign prepared transaction file manually before
recovery. I have added a hint with that effect in the error message. Is

that

enough?

That seems totally broken. Before PITR, the database might be
inconsistent, in which case you can't call any functions at all.
Also, you shouldn't be trying to resolve any transactions until the
end of recovery, because you don't know when you see that the
transaction was prepared whether, at some subsequent time, you will
see it resolved. You need to finish recovery and, only after entering
normal running, decide whether to resolve any transactions that are
still sitting around.

That's how it works in the patch for unresolved prepared foreign
transactions belonging to xids within the known range. For those belonging
to xids in future (beyond of known range of xids after PITR), we can not
determine the status of that local transaction (as those do not appear in
the xlog) and hence can not decide the fate of prepared foreign
transaction. You seem to be suggesting that we should let the recovery
finish and mark those prepared foreign transaction as "can not be resolved"
or something like that. A DBA can remove those entries once s/he has dealt
with them on the foreign server.

There's little problem with that approach. Triplet (xid, serverid, userid)
is used to identify the a foreign prepared transaction entry in memory and
is used to create unique file name for storing it on the disk. If we allow
a future xid after PITR, it might conflict with an xid of a transaction
that might take place after PITR. It will cause problem if exactly same
foreign server and user participate in the transaction with conflicting xid
(rare but possible).

Other problem is that the foreign server on which the transaction was
prepared (or the user whose mapping was used to prepare the transaction),
might have got added in a future time wrt PITR, in which case, we can not
even know which foreign server this transaction was prepared on.

There should be no situation (short of e.g. OS
errors writing the state files) where this stuff makes recovery fail.

During PITR, if we encounter a prepared (local) transaction with a future
xid, we just forget that prepared transaction (instead of failing
recovery). May be we should do the same for unresolved foreign prepared
transaction as well (at least for version 1); forget the unresolved
prepared foreign transactions which belong to a future xid. Anyway, as per
the timeline after PITR those never existed.

Other DBMSes solve this problem by using markers. Markers are allowed to be
set at times when there were no unresolved foreign transactions and PITR is
allowed upto those markers and not any arbitrary point in time. But this
looks out of scope of this patch.

I noticed that the functions pg_fdw_resolve() and pg_fdw_remove() which
resolve or remove unresolved prepared foreign transaction resp. are
effecting changes which can not be rolled back if the transaction which

ran

these functions rolled back. These need to be converted into SQL command
like ROLLBACK PREPARED which can't be run within a transaction.

Yeah, maybe. I'm not sure using a functional interface is all that
bad, but we could think about changing it.

Fine.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#34

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Ashutosh Bapat (#33)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:>

I'm not hung up on the table-level attribute, but I think having a
server-level attribute rather than a global GUC is a good idea.
However, I welcome other thoughts on that.

The patch supports server level attribute. Let me repeat the relevant
description from my earlier mail
--
Every FDW needs to register the connection while starting new transaction on
a foreign connection (RegisterXactForeignServer()). A foreign server
connection is identified by foreign server oid and the local user oid
(similar to the entry cached by postgres_fdw). While registering, FDW also
tells whether the foreign server is capable of participating in two-phase
commit protocol. How to decide that is left entirely to the FDW. An FDW like
file_fdw may not have 2PC support at all, so all its foreign servers do not
comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW
like postgres_fdw can have some of its servers compliant and some not,
depending upon server version, configuration (max_prepared_transactions = 0)
etc.
--

Does that look good?

OK, sure. But let's make sure postgres_fdw gets a server-level option
to control this.

Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more
explanation
later)

(2) and (3) seem like the same thing. I don't see any further
explanation later in your email; what am I missing?

In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz' (fill
the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION
(notice absence of PREPARED and 'xyz').

Oh, OK. But then isn't #3 something we already have? i.e. pgfdw_xact_callback?

That seems totally broken. Before PITR, the database might be
inconsistent, in which case you can't call any functions at all.
Also, you shouldn't be trying to resolve any transactions until the
end of recovery, because you don't know when you see that the
transaction was prepared whether, at some subsequent time, you will
see it resolved. You need to finish recovery and, only after entering
normal running, decide whether to resolve any transactions that are
still sitting around.

That's how it works in the patch for unresolved prepared foreign
transactions belonging to xids within the known range. For those belonging
to xids in future (beyond of known range of xids after PITR), we can not
determine the status of that local transaction (as those do not appear in
the xlog) and hence can not decide the fate of prepared foreign transaction.
You seem to be suggesting that we should let the recovery finish and mark
those prepared foreign transaction as "can not be resolved" or something
like that. A DBA can remove those entries once s/he has dealt with them on
the foreign server.

There's little problem with that approach. Triplet (xid, serverid, userid)
is used to identify the a foreign prepared transaction entry in memory and
is used to create unique file name for storing it on the disk. If we allow a
future xid after PITR, it might conflict with an xid of a transaction that
might take place after PITR. It will cause problem if exactly same foreign
server and user participate in the transaction with conflicting xid (rare
but possible).

Other problem is that the foreign server on which the transaction was
prepared (or the user whose mapping was used to prepare the transaction),
might have got added in a future time wrt PITR, in which case, we can not
even know which foreign server this transaction was prepared on.

There should be no situation (short of e.g. OS
errors writing the state files) where this stuff makes recovery fail.

During PITR, if we encounter a prepared (local) transaction with a future
xid, we just forget that prepared transaction (instead of failing recovery).
May be we should do the same for unresolved foreign prepared transaction as
well (at least for version 1); forget the unresolved prepared foreign
transactions which belong to a future xid. Anyway, as per the timeline after
PITR those never existed.

This last sentence seems to me to be exactly on point. Note the
comment in twophase.c:

* We throw away any prepared xacts with main XID beyond nextXid --- if any
* are present, it suggests that the DBA has done a PITR recovery to an
* earlier point in time without cleaning out pg_twophase. We dare not
* try to recover such prepared xacts since they likely depend on database
* state that doesn't exist now.

In other words, normally there should never be any XIDs "from the
future" with prepared transactions; but in certain PITR scenarios it
might be possible. We might as well be consistent with what the
existing 2PC code does in this case - i.e. just warn and then remove
the files.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Ashutosh Bapat (#26)

Re: Transactions involving multiple postgres foreign servers

On Tue, Feb 17, 2015 at 2:56 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

2. New catalog - This method takes out the need to have separate method
for C1, C5 and even C2, also the synchronization will be taken care of by
row locks, there will be no limit on the number of foreign transactions as
well as the size of foreign prepared transaction information. But big
problem with this approach is that, the changes to the catalogs are atomic
with the local transaction. If a foreign prepared transaction can not be
aborted while the local transaction is rolled back, that entry needs to
retained. But since the local transaction is aborting the corresponding
catalog entry would become invisible and thus unavailable to the resolver
(alas! we do not have autonomous transaction support). We may be able to
overcome this, by simulating autonomous transaction through a background
worker (which can also act as a resolver). But the amount of communication
and synchronization, might affect the performance.

For Rollback, why can't we do it in reverse way, first rollback
transaction in foreign servers and then rollback local transaction.

I think for Commit, it is essential that we first commit in local
server, so that we can resolve the transaction status of prepared
transactions on foreign servers after crash recovery. However
for Abort case, I think even if we don't Rollback in local server, it
can be deduced (any transaction which is not committed should be
Rolledback) during crash recovery for the matter of resolving
transaction status of prepared transaction.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#36

Amit Kapila

amit.kapila16@gmail.com

over 10 years ago

In reply to: Ashutosh Bapat (#29)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 9, 2015 at 3:48 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

2. New catalog - This method takes out the need to have separate method for

C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as
well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with
the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog
entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome
this,
by simulating autonomous transaction through a background worker (which
can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid,
ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code.
Something like a reference implementation, which I can follow.

Does some thing on lines of Copy Freeze can help here?

However if you are going to follow this method, then I think you
need to also ensure when and how to clear those rows after
rollback is complete or once resolver has resolved those prepared
foreign transactions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#37

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Robert Haas (#34)

Re: Transactions involving multiple postgres foreign servers

On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 31, 2015 at 6:33 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:>

I'm not hung up on the table-level attribute, but I think having a
server-level attribute rather than a global GUC is a good idea.
However, I welcome other thoughts on that.

The patch supports server level attribute. Let me repeat the relevant
description from my earlier mail
--
Every FDW needs to register the connection while starting new

transaction on

a foreign connection (RegisterXactForeignServer()). A foreign server
connection is identified by foreign server oid and the local user oid
(similar to the entry cached by postgres_fdw). While registering, FDW

also

tells whether the foreign server is capable of participating in two-phase
commit protocol. How to decide that is left entirely to the FDW. An FDW

like

file_fdw may not have 2PC support at all, so all its foreign servers do

not

comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW
like postgres_fdw can have some of its servers compliant and some not,
depending upon server version, configuration (max_prepared_transactions

= 0)

etc.
--

Does that look good?

OK, sure. But let's make sure postgres_fdw gets a server-level option
to control this.

For postgres_fdw it's a boolean server-level option 'twophase_compliant'
(suggestions for name welcome).

Done, there are three hooks now
1. For preparing a foreign transaction
2. For resolving a prepared foreign transaction
3. For committing/aborting a running foreign transaction (more
explanation
later)

(2) and (3) seem like the same thing. I don't see any further
explanation later in your email; what am I missing?

In case of postgres_fdw, 2 always fires COMMIT/ROLLBACK PREPARED 'xyz'

(fill

the prepared transaction id) and 3 always fires COMMIT/ABORT TRANSACTION
(notice absence of PREPARED and 'xyz').

Oh, OK. But then isn't #3 something we already have? i.e.
pgfdw_xact_callback?

While transactions are being prepared on the foreign connections, if any
prepare fails, we have to abort transactions on the rest of the connections
(and abort the prepared transactions). pgfdw_xact_callback wouldn't know,
which connections have prepared transactions and which do not have. So,
even in case of two-phase commit we need all the three hooks. Since we have
to define these three hooks, we might as well centralize all the
transaction processing and let the foreign transaction manager decide which
of the hooks to invoke. So, the patch moves most of the code in
pgfdw_xact_callback in the relevant hook and foreign transaction manager
invokes appropriate hook. Only thing that remains in pgfdw_xact_callback
now is end of transaction handling like resetting cursor numbering.

That seems totally broken. Before PITR, the database might be
inconsistent, in which case you can't call any functions at all.
Also, you shouldn't be trying to resolve any transactions until the
end of recovery, because you don't know when you see that the
transaction was prepared whether, at some subsequent time, you will
see it resolved. You need to finish recovery and, only after entering
normal running, decide whether to resolve any transactions that are
still sitting around.

That's how it works in the patch for unresolved prepared foreign
transactions belonging to xids within the known range. For those

belonging

to xids in future (beyond of known range of xids after PITR), we can not
determine the status of that local transaction (as those do not appear in
the xlog) and hence can not decide the fate of prepared foreign

transaction.

You seem to be suggesting that we should let the recovery finish and mark
those prepared foreign transaction as "can not be resolved" or something
like that. A DBA can remove those entries once s/he has dealt with them

on

the foreign server.

There's little problem with that approach. Triplet (xid, serverid,

userid)

is used to identify the a foreign prepared transaction entry in memory

and

is used to create unique file name for storing it on the disk. If we

allow a

future xid after PITR, it might conflict with an xid of a transaction

that

might take place after PITR. It will cause problem if exactly same

foreign

server and user participate in the transaction with conflicting xid (rare
but possible).

Other problem is that the foreign server on which the transaction was
prepared (or the user whose mapping was used to prepare the transaction),
might have got added in a future time wrt PITR, in which case, we can not
even know which foreign server this transaction was prepared on.

There should be no situation (short of e.g. OS
errors writing the state files) where this stuff makes recovery fail.

During PITR, if we encounter a prepared (local) transaction with a future
xid, we just forget that prepared transaction (instead of failing

recovery).

May be we should do the same for unresolved foreign prepared transaction

as

well (at least for version 1); forget the unresolved prepared foreign
transactions which belong to a future xid. Anyway, as per the timeline

after

PITR those never existed.

This last sentence seems to me to be exactly on point. Note the
comment in twophase.c:

* We throw away any prepared xacts with main XID beyond nextXid --- if any
* are present, it suggests that the DBA has done a PITR recovery to an
* earlier point in time without cleaning out pg_twophase. We dare not
* try to recover such prepared xacts since they likely depend on database
* state that doesn't exist now.

In other words, normally there should never be any XIDs "from the
future" with prepared transactions; but in certain PITR scenarios it
might be possible. We might as well be consistent with what the
existing 2PC code does in this case - i.e. just warn and then remove
the files.

Ok. Done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#38

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Ashutosh Bapat (#37)

Re: Transactions involving multiple postgres foreign servers

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:

On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, sure. But let's make sure postgres_fdw gets a server-level option
to control this.

For postgres_fdw it's a boolean server-level option 'twophase_compliant'
(suggestions for name welcome).

How about just 'twophase'?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Robert Haas

robertmhaas@gmail.com

over 10 years ago

In reply to: Amit Langote (#38)

Re: Transactions involving multiple postgres foreign servers

On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:

On Sat, Aug 1, 2015 at 12:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:

OK, sure. But let's make sure postgres_fdw gets a server-level option
to control this.

For postgres_fdw it's a boolean server-level option 'twophase_compliant'
(suggestions for name welcome).

How about just 'twophase'?

How about two_phase_commit?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 10 years ago

In reply to: Robert Haas (#39)

Re: Transactions involving multiple postgres foreign servers

On 2015-08-05 AM 06:11, Robert Haas wrote:

On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:

For postgres_fdw it's a boolean server-level option 'twophase_compliant'
(suggestions for name welcome).

How about just 'twophase'?

How about two_phase_commit?

Much cleaner, +1

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Amit Langote (#40)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Wed, Aug 5, 2015 at 6:20 AM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>
wrote:

On 2015-08-05 AM 06:11, Robert Haas wrote:

On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:

For postgres_fdw it's a boolean server-level option

'twophase_compliant'

(suggestions for name welcome).

How about just 'twophase'?

How about two_phase_commit?

Much cleaner, +1

I was more inclined to use an adjective, since it's a property of server,
instead of a noun. But two_phase_commit looks fine as well, included in the
patch attached.

Attached patch addresses all the concerns and suggestions from previous
mails in this mail thread.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchbinary/octet-stream; name=pg_fdw_transact.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/* 
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL; 
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid); 
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/* 
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..341db6f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -57,52 +59,59 @@ typedef struct ConnCacheEntry
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+									bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -116,23 +125,20 @@ GetConnection(ForeignServer *server, UserMapping *user,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
 		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
@@ -152,41 +158,64 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 	server->servername);
+			return NULL;
+		}
+
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+					bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
 	 */
 	PG_TRY();
 	{
 		const char **keywords;
 		const char **values;
@@ -227,25 +256,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
 			connmessage = pstrdup(PQerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+					   		server->servername),
+						errdetail_internal("%s", connmessage)));
 		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
 		if (!superuser() && !PQconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
@@ -362,29 +395,36 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid, entry->key.userid,
+									server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,148 +546,295 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
+		
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key.serverid = serverid;
+		key.userid = userid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+	if (!conn && IsTransactionState())
+	{
+		ForeignServer	*foreign_server = GetForeignServer(serverid); 
+		UserMapping		*user_mapping = GetUserMapping(userid, serverid);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+		conn = GetConnection(foreign_server, user_mapping, false, false, true);
+	}
 
-			switch (event)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+	
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+	
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.  We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
 	 */
 	xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -708,10 +895,33 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1f417b3..118c42b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3634,10 +3634,348 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_with_fdw           | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_with_fdw           | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     2
+(1 row)
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+ pg_fdw_remove 
+---------------
+ 
+(1 row)
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     0
+(1 row)
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+ database | foreign server | local user | status 
+----------+----------------+------------+--------
+(0 rows)
+
+SELECT database, gid FROM pg_prepared_xacts;
+ database | gid 
+----------+-----
+(0 rows)
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_fdw_subxact        | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_fdw_subxact        | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   2
+(2 rows)
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+(3 rows)
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+ transaction | database | foreign server | local user | status | foreign transaction identifier 
+-------------+----------+----------------+------------+--------+--------------------------------
+(0 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..b70bbd3 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -98,21 +98,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -146,20 +147,22 @@ InitPgFdwOptions(void)
 		{"column_name", AttributeRelationId, false},
 		/* use_remote_estimate is available on both server and table */
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e4d799c..f574543 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,22 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -362,20 +364,26 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -918,21 +926,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1316,21 +1324,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1766,21 +1774,21 @@ estimate_path_cost_size(PlannerInfo *root,
 		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
 						 &retrieved_attrs);
 		if (fpinfo->remote_conds)
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2330,21 +2338,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2422,21 +2430,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2623,21 +2631,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2987,10 +2995,11 @@ static void
 conversion_error_callback(void *arg)
 {
 	ConversionLocation *errpos = (ConversionLocation *) arg;
 	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
 		errcontext("column \"%s\" of foreign table \"%s\"",
 				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
 				   RelationGetRelationName(errpos->rel));
 }
+
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..8d24359 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -10,30 +10,32 @@
  *
  *-------------------------------------------------------------------------
  */
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
 #include "utils/relcache.h"
+#include "access/fdw_xact.h"
 
 #include "libpq-fe.h"
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
@@ -67,12 +69,19 @@ extern void deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 				 List *targetAttrs, List *returningList,
 				 List **retrieved_attrs);
 extern void deparseDeleteSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *returningList,
 				 List **retrieved_attrs);
 extern void deparseAnalyzeSizeSql(StringInfo buf, Relation rel);
 extern void deparseAnalyzeSql(StringInfo buf, Relation rel,
 				  List **retrieved_attrs);
 extern void deparseStringLiteral(StringInfo buf, const char *val);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info);
 
 #endif   /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index fcdd92e..e137420 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -827,10 +827,239 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+SELECT database, gid FROM pg_prepared_xacts;
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..f918f87 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1411,20 +1411,62 @@ include_dir 'conf.d'
        </para>
 
        <para>
         When running a standby server, you must set this parameter to the
         same or higher value than on the master server. Otherwise, queries
         will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
        <primary><varname>work_mem</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
         Specifies the amount of memory to be used by internal sort operations
         and hash tables before writing to temporary disk files. The value
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2361577..9b931e4 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -918,20 +918,85 @@ ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);
      useful to test whether a given foreign-table name will pass the filter.
     </para>
 
     <para>
      If the FDW does not support importing table definitions, the
      <function>ImportForeignSchema</> pointer can be set to <literal>NULL</>.
     </para>
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-transactions">
+    <title>FDW Routines For transaction management</title>
+
+    <para>
+<programlisting>
+char *
+GetPrepareInfo (Oid serverOid, Oid userid, int *prep_info_len);
+</programlisting>
+
+     Get prepared transaction identifier for given foreign server and user.
+     This function is called when executing <xref linkend="sql-commit">, if
+     <literal>atomic_foreign_transaction</> is enabled. It should return a
+     valid transaction identifier that will be used to prepare transaction
+     on the foreign server. The <parameter>prep_info_len</> should be set
+     to the length of this identifier. The identifier should not be longer
+     than 256 bytes. The identifier should not cause conflict with existing
+     identifiers on the foreign server. It should be unique enough not to
+     identify a transaction in future. It's possible that a transaction is
+     considered unresolved on <productname>PostgreSQL</> while it is resolved
+     in reality. This causes the foreign transaction resolver to try resolving
+     the transaction till it finds out that the transaction has been resolved.
+     In case the transaction identifier is same as a future transaction identifier
+     there is a possibility of the future transaction getting resolved
+     erroneously.
+    </para>
+
+    <para>
+     If a foreign server with Foreign Data Wrapper having <literal>NULL</>
+      <function>GetPrepareInfo</> pointer participates in a transaction
+      with<literal>atomic_foreign_transaction</> enabled, the transaction
+      is aborted.
+    </para>
+
+    <para>
+<programlisting>
+bool
+HandleForeignTransaction (Oid serverOid, Oid userid, FDWXactAction action,
+                            int prep_id_len, char *prep_id)
+</programlisting>
+
+     Function to end a transaction on the given foreign server with given user.
+     This function is called when executing <xref linkend="sql-commit"> or
+     <xref linkend="sql-rollback">. The function should complete a transaction
+     according to the <parameter>action</> specified. The function should
+     return TRUE on successful completion of transaction and FALSE otherwise.
+     It should not throw an error in case of failure to complete the transaction.
+    </para>
+
+    <para>
+    When <parameter>action</> is FDW_XACT_COMMIT or FDW_XACT_ABORT, the function
+    should commit or rollback the running transaction resp. When <parameter>action</>
+    is FDW_XACT_PREPARE, the running transaction should be prepared with the
+    identifier given by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    When <parameter>action</> is FDW_XACT_ABORT_PREPARED or FDW_XACT_COMMIT_PREPARED
+    the function should respectively commit or rollback the transaction identified
+    by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    </para>
+
+    <para>
+    This function is usually called at the end of the transaction, when the
+    access to the database may not be possible. Trying to access catalogs
+    in this function may cause error to be thrown and can affect other foreign
+    data wrappers. 
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
     <title>Foreign Data Wrapper Helper Functions</title>
 
     <para>
      Several helper functions are exported from the core server so that
      authors of foreign data wrappers can get easy access to attributes of
      FDW-related objects, such as FDW options.
      To use any of these functions, you need to include the header file
@@ -1308,11 +1373,93 @@ GetForeignServerByName(const char *name, bool missing_ok);
     <para>
      See <filename>src/include/nodes/lockoptions.h</>, the comments
      for <type>RowMarkType</> and <type>PlanRowMark</>
      in <filename>src/include/nodes/plannodes.h</>, and the comments for
      <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
      additional information.
     </para>
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..8c1afcf 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -1,16 +1,16 @@
 #
 # Makefile for the rmgr descriptor routines
 #
 # src/backend/access/rmgrdesc/Makefile
 #
 
 subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o gindesc.o \
+	   gistdesc.o hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
 	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..0f0c899
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module describes the WAL records for foreign transaction manager. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 4f29136..ad07038 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -104,28 +104,29 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 			if (entry->val == xlrec.wal_level)
 			{
 				wal_level_str = entry->name;
 				break;
 			}
 		}
 
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamps=%s",
+						 "track_commit_timestamps=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
 	else if (info == XLOG_END_OF_RECOVERY)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 94455b2..51b2efd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o parallel.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..9f315d9
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2024 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */ 
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine 		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u with user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(fdw_conn->serverid); 
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */ 
+	Oid				serveroid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */ 
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serveroid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serveroid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid,
+								Oid userid, int fdw_xact_info_len,
+								char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serveroid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Prepare the transactions on the foreign servers, which can execute
+	 * two-phase-commit protocol.
+	 */
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/* 
+	 * Loop over the foreign connections 
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+															fdw_conn->userid,
+															&fdw_xact_info_len);
+		
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 * 
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+											fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact; 
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serveroid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid; 
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serveroid = fdw_xact->serveroid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serveroid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */ 
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.  Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.  The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */ 
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id,
+					FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serveroid == serveroid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serveroid %u, userid %u found",
+						xid, serveroid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = serveroid;
+	fdw_xact->userid = userid;
+	fdw_xact->fdw_xact_status = fdw_xact_status; 
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */ 
+			fdw_remove_xlog.serveroid = fdw_xact->serveroid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serveroid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/* 
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *    according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *    succeeds, remove them from foreign transaction entries, otherwise unlock
+ *    them.
+ */ 
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/* 
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool 
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serveroid, fdw_xact->userid,
+								is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+	
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid     | dbid    | serverid | userid  | search for entry with
+ * invalid | invalid | invalid  | invalid | nothing
+ * invalid | invalid | invalid  | valid   | given userid
+ * invalid | invalid | valid    | invalid | given serverid
+ * invalid | invalid | valid    | valid   | given serverid and userid
+ * invalid | valid   | invalid  | invalid | given dbid
+ * invalid | valid   | invalid  | valid   | given dbid and userid
+ * invalid | valid   | valid    | invalid | given dbid and serverid
+ * invalid | valid   | valid    | valid   | given dbid, servroid and userid
+ * valid   | invalid | invalid  | invalid | given xid
+ * valid   | invalid | invalid  | valid   | given xid and userid
+ * valid   | invalid | valid    | invalid | given xid, serverid
+ * valid   | invalid | valid    | valid   | given xid, serverid, userid
+ * valid   | valid   | invalid  | invalid | given xid and dbid 
+ * valid   | valid   | invalid  | valid   | given xid, dbid and userid
+ * valid   | valid   | valid    | invalid | given xid, dbid, serverid
+ * valid   | valid   | valid    | valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+		
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serveroid)
+			entry_matches = false;
+		
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+	
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	  		*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+		
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serveroid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+	
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+	
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+		
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serveroid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serveroids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serveroids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serveroids[nxacts] = fdw_xact->serveroid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serveroids[cnt], userids[cnt]);
+			
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serveroids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serveroids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+	
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+	
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+	
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+			
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+	
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+	
+	
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+	
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *    foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									PG_GETARG_TRANSACTIONID(XID_ARGNUM);
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+	
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serveroid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serveroid != serveroid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+	
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *    conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+	
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serveroid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serveroid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										serveroid, userid,
+										fdw_xact_file_data->fdw_xact_id_len,
+										fdw_xact_file_data->fdw_xact_id,
+										FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */	
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+	
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..cdbc583 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -7,20 +7,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 177d1e1..5c9aec7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -34,20 +34,21 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <time.h>
 #include <unistd.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
@@ -1469,20 +1470,26 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		RelationCacheInitFilePostInvalidate();
 
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
 	else
 		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
 	/*
 	 * And now we can clean up our mess.
 	 */
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
 	MyLockedGxact = NULL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b53d95f..aaa0edc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -14,20 +14,21 @@
  *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -179,20 +180,24 @@ typedef struct TransactionStateData
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -1884,20 +1889,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -1940,20 +1948,23 @@ CommitTransaction(void)
 
 		/*
 		 * Close open portals (converting holdable ones into static portals).
 		 * If there weren't any, we are done ... otherwise loop back to check
 		 * if they queued deferred triggers.  Lather, rinse, repeat.
 		 */
 		if (!PreCommit_Portals(false))
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
 	/*
 	 * The remaining actions cannot call any user-defined code, so it's safe
 	 * to start shutting down within-transaction services.  But note that most
 	 * of this stuff could still throw an error, which would switch us into
 	 * the transaction-abort path.
 	 */
 
@@ -2099,20 +2110,21 @@ CommitTransaction(void)
 	AtEOXact_GUC(true, 1);
 	AtEOXact_SPI(true);
 	AtEOXact_on_commit_actions(true);
 	AtEOXact_Namespace(true, is_parallel_worker);
 	AtEOXact_SMgr();
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
 	TopTransactionResourceOwner = NULL;
 
 	AtCommit_Memory();
 
@@ -2283,20 +2295,21 @@ PrepareTransaction(void)
 	 * before or after releasing the transaction's locks.
 	 */
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
 	 *
 	 * We have to record transaction prepares even if we didn't make any
 	 * updates, because the transaction manager might get confused if we lose
 	 * a global transaction.
 	 */
 	EndPrepare(gxact);
 
@@ -2565,20 +2578,21 @@ AbortTransaction(void)
 
 		AtEOXact_GUC(false, 1);
 		AtEOXact_SPI(false);
 		AtEOXact_on_commit_actions(false);
 		AtEOXact_Namespace(false, is_parallel_worker);
 		AtEOXact_SMgr();
 		AtEOXact_Files();
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
 	RESUME_INTERRUPTS();
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1dd31b3..120d897 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -16,20 +16,21 @@
 
 #include <ctype.h>
 #include <time.h>
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -4847,20 +4848,21 @@ BootStrapXLOG(void)
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
 	WriteControlFile();
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5852,20 +5854,26 @@ CheckRequiredParameterValues(void)
 									 ControlFile->max_worker_processes);
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
 		RecoveryRequiresBoolParameter("track_commit_timestamp",
 									  track_commit_timestamp,
 									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
 StartupXLOG(void)
 {
 	XLogCtlInsert *Insert;
@@ -6508,21 +6516,24 @@ StartupXLOG(void)
 		{
 			TransactionId *xids;
 			int			nxids;
 
 			ereport(DEBUG1,
 					(errmsg("initializing for hot standby")));
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
 
 			/* Tell procarray about the range of xids it has to deal with */
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
 			 * Startup commit log, commit timestamp and subtrans only.
 			 * MultiXact has already been started up and other SLRUs are not
@@ -7108,20 +7119,21 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7305,20 +7317,26 @@ StartupXLOG(void)
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
 	TrimCLOG();
 	TrimMultiXact();
 
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
 		close(readFile);
@@ -8579,20 +8597,25 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
 	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
  * Save a checkpoint for recovery restart if appropriate
  *
  * This function is called each time a checkpoint record is read from XLOG.
  * It must determine whether the checkpoint represents a safe restartpoint or
  * not.  If so, the checkpoint record is stashed in shared memory so that
  * CreateRestartPoint can consult it.  (Note that the latter function is
  * executed by the checkpointer, while this one will be executed by the
@@ -9004,56 +9027,59 @@ XLogRestorePoint(const char *rpName)
  */
 static void
 XLogReportParameters(void)
 {
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
 		 * if archiving is not enabled, as you can't start archive recovery
 		 * with wal_level=minimal anyway. We don't really care about the
 		 * values in pg_control either if wal_level=minimal, but seems better
 		 * to keep them up-to-date to avoid confusion.
 		 */
 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
 		{
 			xl_parameter_change xlrec;
 			XLogRecPtr	recptr;
 
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 			recptr = XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE);
 			XLogFlush(recptr);
 		}
 
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
 
 /*
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
  * Note: this function assumes there is no other process running
  * concurrently that could update it.
@@ -9227,20 +9253,21 @@ xlog_redo(XLogReaderState *record)
 		 */
 		if (standbyState >= STANDBY_INITIALIZED)
 		{
 			TransactionId *xids;
 			int			nxids;
 			TransactionId oldestActiveXID;
 			TransactionId latestCompletedXid;
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
 			 * down server, with only prepared transactions still alive. We're
 			 * never overflowed at this point because all subxids are listed
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
 			running.subxcnt = 0;
 			running.subxid_overflow = false;
@@ -9416,20 +9443,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
 		 * recover back up to this point before allowing hot standby again.
 		 * This is particularly important if wal_level was set to 'archive'
 		 * before, and is now 'hot_standby', to ensure you don't run queries
 		 * against the WAL preceding the wal_level change. Same applies to
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 95d6c14..3100f50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -11,20 +11,21 @@
  *	  src/backend/bootstrap/bootstrap.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/htup_details.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index c0bd6fa..07d0960 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -236,20 +236,29 @@ CREATE VIEW pg_available_extension_versions AS
            LEFT JOIN pg_extension AS X
              ON E.name = X.extname AND E.version = X.extversion;
 
 CREATE VIEW pg_prepared_xacts AS
     SELECT P.transaction, P.gid, P.prepared,
            U.rolname AS owner, D.datname AS database
     FROM pg_prepared_xact() AS P
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier" 
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
 	CASE WHEN rel.relkind = 'r' THEN 'table'::text
 		 WHEN rel.relkind = 'v' THEN 'view'::text
 		 WHEN rel.relkind = 'm' THEN 'materialized view'::text
 		 WHEN rel.relkind = 'S' THEN 'sequence'::text
@@ -925,10 +934,18 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'make_interval';
 
 CREATE OR REPLACE FUNCTION
   jsonb_set(jsonb_in jsonb, path text[] , replacement jsonb,
             create_if_missing boolean DEFAULT true)
 RETURNS jsonb
 LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'jsonb_set';
+
+CREATE OR REPLACE FUNCTION
+  pg_fdw_remove(transaction xid DEFAULT NULL, dbid oid DEFAULT NULL,
+				serverid oid DEFAULT NULL, userid oid DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+VOLATILE
+AS 'pg_fdw_remove';
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index cc912b2..3408252 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -6,20 +6,21 @@
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  *
  *
  * IDENTIFICATION
  *	  src/backend/commands/foreigncmds.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
@@ -1080,20 +1081,34 @@ RemoveForeignServerById(Oid srvId)
 	HeapTuple	tp;
 	Relation	rel;
 
 	rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
 
 	tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
 
 	heap_close(rel, RowExclusiveLock);
 }
 
 
 /*
  * Common routine to check permission for user-mapping-related DDL
@@ -1252,20 +1267,21 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
 
 	umId = GetSysCacheOid2(USERMAPPINGUSERSERVER,
 						   ObjectIdGetDatum(useId),
 						   ObjectIdGetDatum(srv->serverid));
 	if (!OidIsValid(umId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("user mapping \"%s\" does not exist for the server",
 						MappingUserName(useId))));
 
+
 	user_mapping_ddl_aclcheck(useId, srv->serverid, stmt->servername);
 
 	tp = SearchSysCacheCopy1(USERMAPPINGOID, ObjectIdGetDatum(umId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for user mapping %u", umId);
 
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
@@ -1378,20 +1394,31 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 		/* IF EXISTS specified, just note it */
 		ereport(NOTICE,
 		(errmsg("user mapping \"%s\" does not exist for the server, skipping",
 				MappingUserName(useId))));
 		return InvalidOid;
 	}
 
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
 	object.objectId = umId;
 	object.objectSubId = 0;
 
 	performDeletion(&object, DROP_CASCADE, 0);
 
 	return umId;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 1bb3138..01594c1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -86,20 +86,21 @@
 #ifdef USE_BONJOUR
 #include <dns_sd.h>
 #endif
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
 #include <pthread.h>
 #endif
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "lib/ilist.h"
 #include "libpq/auth.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
@@ -2447,21 +2448,20 @@ pmdie(SIGNAL_ARGS)
 				SignalUnconnectedWorkers(SIGTERM);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
 				/* and the bgwriter too */
 				if (BgWriterPID != 0)
 					signal_child(BgWriterPID, SIGTERM);
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-
 				/*
 				 * If we're in recovery, we can't kill the startup process
 				 * right away, because at present doing so does not release
 				 * its locks.  We might want to change this in a future
 				 * release.  For the time being, the PM_WAIT_READONLY state
 				 * indicates that we're waiting for the regular (read only)
 				 * backends to die off; once they do, we'll kill the startup
 				 * and walreceiver processes.
 				 */
 				pmState = (pmState == PM_RUN) ?
@@ -5689,20 +5689,21 @@ PostmasterMarkPIDForWorkerNotify(int pid)
 
 	dlist_foreach(iter, &BackendList)
 	{
 		bp = dlist_container(Backend, elem, iter.cur);
 		if (bp->pid == pid)
 		{
 			bp->bgworker_notify = true;
 			return true;
 		}
 	}
+
 	return false;
 }
 
 #ifdef EXEC_BACKEND
 
 /*
  * The following need to be available to the save/restore_backend_variables
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..6fdd818 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -127,20 +127,21 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_MULTIXACT_ID:
 		case RM_RELMAP_ID:
 		case RM_BTREE_ID:
 		case RM_HASH_ID:
 		case RM_GIN_ID:
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
 }
 
 /*
  * Handle rmgr XLOG_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a790e5b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -132,20 +133,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, ReplicationOriginShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -243,20 +245,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index 6b61765..d6cba87 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -15,21 +15,20 @@
 #include "postgres.h"
 
 #include <limits.h>
 
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 
-#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 #define PG_RETURN_TRANSACTIONID(x)	return TransactionIdGetDatum(x)
 
 #define PG_GETARG_COMMANDID(n)		DatumGetCommandId(PG_GETARG_DATUM(n))
 #define PG_RETURN_COMMANDID(x)		return CommandIdGetDatum(x)
 
 
 Datum
 xidin(PG_FUNCTION_ARGS)
 {
 	char	   *str = PG_GETARG_CSTRING(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 1b7b914..e53751e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -20,20 +20,21 @@
 #include <float.h>
 #include <math.h>
 #include <limits.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #ifdef HAVE_SYSLOG
 #include <syslog.h>
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
 #include "commands/trigger.h"
@@ -1999,20 +2000,33 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
 			NULL
 		},
 		&max_prepared_xacts,
 		0, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Sets the minimum OID of tables for tracking locks."),
 			gettext_noop("Is used to avoid output on system tables."),
 			GUC_NOT_IN_SAMPLE
 		},
 		&Trace_lock_oidmin,
 		FirstNormalObjectId, 0, INT_MAX,
 		NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..2107f95 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -116,20 +116,26 @@
 					# (change requires restart)
 #huge_pages = try			# on, off, or try
 					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
 # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
 # per transaction slot, plus lock space (see max_locks_per_transaction).
 # It is not advisable to set max_prepared_transactions nonzero unless you
 # actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0		# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature. 
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
 					# supported by the operating system:
 					#   posix
 					#   sysv
 					#   windows
 					#   mmap
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index feeff9e..47ecf1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,31 +192,32 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
 	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d8cfe5e..00aad71 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -324,12 +324,14 @@ main(int argc, char *argv[])
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile.loblksize);
 	printf(_("Date/time type storage:               %s\n"),
 		   (ControlFile.enableIntTimes ? _("64-bit integers") : _("floating-point numbers")));
 	printf(_("Float4 argument passing:              %s\n"),
 		   (ControlFile.float4ByVal ? _("by value") : _("by reference")));
 	printf(_("Float8 argument passing:              %s\n"),
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile.max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index e7e8059..880e895 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -581,20 +581,21 @@ GuessControlValues(void)
 	ControlFile.unloggedLSN = 1;
 
 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
 	ControlFile.relseg_size = RELSEG_SIZE;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
 	ControlFile.indexMaxKeys = INDEX_MAX_KEYS;
@@ -797,20 +798,21 @@ RewriteControlFile(void)
 	 * Force the defaults for max_* settings. The values don't really matter
 	 * as long as wal_level='minimal'; the postmaster will reset these fields
 	 * anyway at startup.
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
 	ControlFile.xlog_seg_size = XLogSegSize;
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile.crc);
 	COMP_CRC32C(ControlFile.crc,
 				(char *) &ControlFile,
 				offsetof(ControlFileData, crc));
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 2205d6e..b9f3d84 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
 	{ name, desc, identify},
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..2f43ecd
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,73 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#include "storage/backendid.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "foreign/fdwapi.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serveroid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serveroid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serveroid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..7272c33 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -37,11 +37,12 @@ PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify,
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
 PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb1c2db..d614ab6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -296,20 +296,21 @@ typedef struct xl_xact_parsed_abort
 	RelFileNode *xnodes;
 
 	TransactionId twophase_xid; /* only for 2PC */
 } xl_xact_parsed_abort;
 
 
 /* ----------------
  *		extern definitions
  * ----------------
  */
+#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 extern bool IsTransactionState(void);
 extern bool IsAbortedTransactionBlockState(void);
 extern TransactionId GetTopTransactionId(void);
 extern TransactionId GetTopTransactionIdIfAny(void);
 extern TransactionId GetCurrentTransactionId(void);
 extern TransactionId GetCurrentTransactionIdIfAny(void);
 extern TransactionId GetStableLatestTransactionId(void);
 extern SubTransactionId GetCurrentSubTransactionId(void);
 extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 5ebaa5f..c4d80e6 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -206,20 +206,21 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /*
  * Information logged when we detect a change in one of the parameters
  * important for Hot Standby.
  */
 typedef struct xl_parameter_change
 {
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
 typedef struct xl_restore_point
 {
 	TimestampTz rp_time;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..d168c32 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -173,20 +173,21 @@ typedef struct ControlFileData
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
 	 * or hot standby.
 	 */
 	int			wal_level;
 	bool		wal_log_hints;
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
 	 * machine of different endianness, but we do need to worry about MAXALIGN
 	 * and floating-point format.  (Note: storage layout nominally also
 	 * depends on SHORTALIGN and INTALIGN, but in practice these are the same
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 2563bb9..829dcac 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5266,20 +5266,26 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 20 "2276" "{2276}" "{v}" _null_ _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4066 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4083 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4099 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3584 ( binary_upgrade_set_next_array_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_array_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3585 ( binary_upgrade_set_next_toast_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_toast_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3586 ( binary_upgrade_set_next_heap_pg_class_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_heap_pg_class_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..d1ddb4e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -5,20 +5,21 @@
  *
  * Copyright (c) 2010-2015, PostgreSQL Global Development Group
  *
  * src/include/foreign/fdwapi.h
  *
  *-------------------------------------------------------------------------
  */
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/xact.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
 
 
 /*
  * Callback function signatures --- see fdwhandler.sgml for more info.
  */
@@ -110,20 +111,32 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*EndForeignTransaction_function) (Oid serverOid, Oid userid,
+													bool is_commit);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverOid, Oid userid,
+														int prep_info_len,
+														char *prep_info);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverOid, Oid userid,
+															bool is_commit,
+														int prep_info_len,
+														char *prep_info);
+typedef char *(*GetPrepareId_function) (Oid serverOid, Oid userid,
+														int *prep_info_len);
+
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -165,20 +178,26 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for foreign transactions */
+	GetPrepareId_function				GetPrepareId;
+	EndForeignTransaction_function		EndForeignTransaction;
+	PrepareForeignTransaction_function	PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern Oid	GetForeignServerIdByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineByServerId(Oid serverid);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cff3b99..d03b119 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -128,22 +128,23 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define OldSerXidLock				(&MainLWLockArray[31].lock)
 #define SyncRepLock					(&MainLWLockArray[32].lock)
 #define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
 #define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define FDWXactLock					(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
  * here, but we need them to figure out offsets within MainLWLockArray, and
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 202a672..3814feb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -210,25 +210,26 @@ typedef struct PROC_HDR
 } PROC_HDR;
 
 extern PROC_HDR *ProcGlobal;
 
 extern PGPROC *PreparedXactProcs;
 
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer, WAL writer and foreign transaction resolver
+ * run during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
 extern int	DeadlockTimeout;
 extern int	StatementTimeout;
 extern int	LockTimeout;
 extern bool log_lock_waits;
 
 
 /*
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..d31ceb0 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1257,11 +1257,15 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 6206c81..8bd6c55 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1305,20 +1305,30 @@ pg_available_extensions| SELECT e.name,
     e.comment
    FROM (pg_available_extensions() e(name, default_version, comment)
      LEFT JOIN pg_extension x ON ((e.name = x.extname)));
 pg_cursors| SELECT c.name,
     c.statement,
     c.is_holdable,
     c.is_binary,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
     a.name,
     a.setting,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index dd65ab5..3c23446 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2224,37 +2224,40 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (system(buf))
 		{
 			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
 			exit(2);
 		}
 
 		/*
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
 		if (pg_conf == NULL)
 		{
 			fprintf(stderr, _("\n%s: could not open \"%s\" for adding extra config: %s\n"), progname, buf, strerror(errno));
 			exit(2);
 		}
 		fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
 		fputs("log_autovacuum_min_duration = 0\n", pg_conf);
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		if (temp_config != NULL)
 		{
 			FILE	   *extra_conf;
 			char		line_buf[1024];
 
 			extra_conf = fopen(temp_config, "r");
 			if (extra_conf == NULL)
 			{
 				fprintf(stderr, _("\n%s: could not open \"%s\" to read extra config: %s\n"), progname, temp_config, strerror(errno));

#42

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 10 years ago

In reply to: Ashutosh Bapat (#41)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

The previous patch would not compile on the latest HEAD. Here's updated
patch.

On Tue, Aug 11, 2015 at 1:55 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Aug 5, 2015 at 6:20 AM, Amit Langote <
Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-05 AM 06:11, Robert Haas wrote:

On Mon, Aug 3, 2015 at 8:19 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2015-08-03 PM 09:24, Ashutosh Bapat wrote:

For postgres_fdw it's a boolean server-level option

'twophase_compliant'

(suggestions for name welcome).

How about just 'twophase'?

How about two_phase_commit?

Much cleaner, +1

I was more inclined to use an adjective, since it's a property of server,
instead of a noun. But two_phase_commit looks fine as well, included in the
patch attached.

Attached patch addresses all the concerns and suggestions from previous
mails in this mail thread.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact.patchbinary/octet-stream; name=pg_fdw_transact.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/* 
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL; 
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid); 
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/* 
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..341db6f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -57,52 +59,59 @@ typedef struct ConnCacheEntry
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+									bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -116,23 +125,20 @@ GetConnection(ForeignServer *server, UserMapping *user,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
 		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
@@ -152,41 +158,64 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 	server->servername);
+			return NULL;
+		}
+
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+					bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
 	 */
 	PG_TRY();
 	{
 		const char **keywords;
 		const char **values;
@@ -227,25 +256,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
 			connmessage = pstrdup(PQerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+					   		server->servername),
+						errdetail_internal("%s", connmessage)));
 		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
 		if (!superuser() && !PQconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
@@ -362,29 +395,36 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid, entry->key.userid,
+									server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,148 +546,295 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
+		
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key.serverid = serverid;
+		key.userid = userid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+	if (!conn && IsTransactionState())
+	{
+		ForeignServer	*foreign_server = GetForeignServer(serverid); 
+		UserMapping		*user_mapping = GetUserMapping(userid, serverid);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+		conn = GetConnection(foreign_server, user_mapping, false, false, true);
+	}
 
-			switch (event)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+	
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+	
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.  We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
 	 */
 	xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -708,10 +895,33 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1f417b3..118c42b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3634,10 +3634,348 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_with_fdw           | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_with_fdw           | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     2
+(1 row)
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+ pg_fdw_remove 
+---------------
+ 
+(1 row)
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     0
+(1 row)
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+ database | foreign server | local user | status 
+----------+----------------+------------+--------
+(0 rows)
+
+SELECT database, gid FROM pg_prepared_xacts;
+ database | gid 
+----------+-----
+(0 rows)
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_fdw_subxact        | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_fdw_subxact        | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   2
+(2 rows)
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+(3 rows)
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+ transaction | database | foreign server | local user | status | foreign transaction identifier 
+-------------+----------+----------------+------------+--------+--------------------------------
+(0 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 7547ec2..b70bbd3 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -98,21 +98,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -146,20 +147,22 @@ InitPgFdwOptions(void)
 		{"column_name", AttributeRelationId, false},
 		/* use_remote_estimate is available on both server and table */
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e4d799c..f574543 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,22 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -362,20 +364,26 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -918,21 +926,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1316,21 +1324,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1766,21 +1774,21 @@ estimate_path_cost_size(PlannerInfo *root,
 		deparseSelectSql(&sql, root, baserel, fpinfo->attrs_used,
 						 &retrieved_attrs);
 		if (fpinfo->remote_conds)
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2330,21 +2338,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2422,21 +2430,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2623,21 +2631,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
@@ -2987,10 +2995,11 @@ static void
 conversion_error_callback(void *arg)
 {
 	ConversionLocation *errpos = (ConversionLocation *) arg;
 	TupleDesc	tupdesc = RelationGetDescr(errpos->rel);
 
 	if (errpos->cur_attno > 0 && errpos->cur_attno <= tupdesc->natts)
 		errcontext("column \"%s\" of foreign table \"%s\"",
 				   NameStr(tupdesc->attrs[errpos->cur_attno - 1]->attname),
 				   RelationGetRelationName(errpos->rel));
 }
+
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 3835ddb..8d24359 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -10,30 +10,32 @@
  *
  *-------------------------------------------------------------------------
  */
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
 #include "utils/relcache.h"
+#include "access/fdw_xact.h"
 
 #include "libpq-fe.h"
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
@@ -67,12 +69,19 @@ extern void deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 				 List *targetAttrs, List *returningList,
 				 List **retrieved_attrs);
 extern void deparseDeleteSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *returningList,
 				 List **retrieved_attrs);
 extern void deparseAnalyzeSizeSql(StringInfo buf, Relation rel);
 extern void deparseAnalyzeSql(StringInfo buf, Relation rel,
 				  List **retrieved_attrs);
 extern void deparseStringLiteral(StringInfo buf, const char *val);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info);
 
 #endif   /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index fcdd92e..e137420 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -827,10 +827,239 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+SELECT database, gid FROM pg_prepared_xacts;
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dcc..f918f87 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1411,20 +1411,62 @@ include_dir 'conf.d'
        </para>
 
        <para>
         When running a standby server, you must set this parameter to the
         same or higher value than on the master server. Otherwise, queries
         will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
        <primary><varname>work_mem</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
         Specifies the amount of memory to be used by internal sort operations
         and hash tables before writing to temporary disk files. The value
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 2361577..9b931e4 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -918,20 +918,85 @@ ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);
      useful to test whether a given foreign-table name will pass the filter.
     </para>
 
     <para>
      If the FDW does not support importing table definitions, the
      <function>ImportForeignSchema</> pointer can be set to <literal>NULL</>.
     </para>
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-transactions">
+    <title>FDW Routines For transaction management</title>
+
+    <para>
+<programlisting>
+char *
+GetPrepareInfo (Oid serverOid, Oid userid, int *prep_info_len);
+</programlisting>
+
+     Get prepared transaction identifier for given foreign server and user.
+     This function is called when executing <xref linkend="sql-commit">, if
+     <literal>atomic_foreign_transaction</> is enabled. It should return a
+     valid transaction identifier that will be used to prepare transaction
+     on the foreign server. The <parameter>prep_info_len</> should be set
+     to the length of this identifier. The identifier should not be longer
+     than 256 bytes. The identifier should not cause conflict with existing
+     identifiers on the foreign server. It should be unique enough not to
+     identify a transaction in future. It's possible that a transaction is
+     considered unresolved on <productname>PostgreSQL</> while it is resolved
+     in reality. This causes the foreign transaction resolver to try resolving
+     the transaction till it finds out that the transaction has been resolved.
+     In case the transaction identifier is same as a future transaction identifier
+     there is a possibility of the future transaction getting resolved
+     erroneously.
+    </para>
+
+    <para>
+     If a foreign server with Foreign Data Wrapper having <literal>NULL</>
+      <function>GetPrepareInfo</> pointer participates in a transaction
+      with<literal>atomic_foreign_transaction</> enabled, the transaction
+      is aborted.
+    </para>
+
+    <para>
+<programlisting>
+bool
+HandleForeignTransaction (Oid serverOid, Oid userid, FDWXactAction action,
+                            int prep_id_len, char *prep_id)
+</programlisting>
+
+     Function to end a transaction on the given foreign server with given user.
+     This function is called when executing <xref linkend="sql-commit"> or
+     <xref linkend="sql-rollback">. The function should complete a transaction
+     according to the <parameter>action</> specified. The function should
+     return TRUE on successful completion of transaction and FALSE otherwise.
+     It should not throw an error in case of failure to complete the transaction.
+    </para>
+
+    <para>
+    When <parameter>action</> is FDW_XACT_COMMIT or FDW_XACT_ABORT, the function
+    should commit or rollback the running transaction resp. When <parameter>action</>
+    is FDW_XACT_PREPARE, the running transaction should be prepared with the
+    identifier given by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    When <parameter>action</> is FDW_XACT_ABORT_PREPARED or FDW_XACT_COMMIT_PREPARED
+    the function should respectively commit or rollback the transaction identified
+    by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    </para>
+
+    <para>
+    This function is usually called at the end of the transaction, when the
+    access to the database may not be possible. Trying to access catalogs
+    in this function may cause error to be thrown and can affect other foreign
+    data wrappers. 
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
     <title>Foreign Data Wrapper Helper Functions</title>
 
     <para>
      Several helper functions are exported from the core server so that
      authors of foreign data wrappers can get easy access to attributes of
      FDW-related objects, such as FDW options.
      To use any of these functions, you need to include the header file
@@ -1308,11 +1373,93 @@ GetForeignServerByName(const char *name, bool missing_ok);
     <para>
      See <filename>src/include/nodes/lockoptions.h</>, the comments
      for <type>RowMarkType</> and <type>PlanRowMark</>
      in <filename>src/include/nodes/plannodes.h</>, and the comments for
      <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
      additional information.
     </para>
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..8c1afcf 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -1,16 +1,16 @@
 #
 # Makefile for the rmgr descriptor routines
 #
 # src/backend/access/rmgrdesc/Makefile
 #
 
 subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o gindesc.o \
+	   gistdesc.o hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
 	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..0f0c899
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module describes the WAL records for foreign transaction manager. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 4f29136..ad07038 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -104,28 +104,29 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 			if (entry->val == xlrec.wal_level)
 			{
 				wal_level_str = entry->name;
 				break;
 			}
 		}
 
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamps=%s",
+						 "track_commit_timestamps=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
 	else if (info == XLOG_END_OF_RECOVERY)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 94455b2..51b2efd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o parallel.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..9f315d9
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2024 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */ 
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine 		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u with user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(fdw_conn->serverid); 
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */ 
+	Oid				serveroid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */ 
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serveroid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serveroid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid,
+								Oid userid, int fdw_xact_info_len,
+								char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serveroid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Prepare the transactions on the foreign servers, which can execute
+	 * two-phase-commit protocol.
+	 */
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/* 
+	 * Loop over the foreign connections 
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+															fdw_conn->userid,
+															&fdw_xact_info_len);
+		
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 * 
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+											fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact; 
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serveroid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid; 
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serveroid = fdw_xact->serveroid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serveroid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */ 
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.  Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.  The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */ 
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id,
+					FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serveroid == serveroid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serveroid %u, userid %u found",
+						xid, serveroid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = serveroid;
+	fdw_xact->userid = userid;
+	fdw_xact->fdw_xact_status = fdw_xact_status; 
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */ 
+			fdw_remove_xlog.serveroid = fdw_xact->serveroid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serveroid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/* 
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *    according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *    succeeds, remove them from foreign transaction entries, otherwise unlock
+ *    them.
+ */ 
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/* 
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool 
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serveroid, fdw_xact->userid,
+								is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+	
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid     | dbid    | serverid | userid  | search for entry with
+ * invalid | invalid | invalid  | invalid | nothing
+ * invalid | invalid | invalid  | valid   | given userid
+ * invalid | invalid | valid    | invalid | given serverid
+ * invalid | invalid | valid    | valid   | given serverid and userid
+ * invalid | valid   | invalid  | invalid | given dbid
+ * invalid | valid   | invalid  | valid   | given dbid and userid
+ * invalid | valid   | valid    | invalid | given dbid and serverid
+ * invalid | valid   | valid    | valid   | given dbid, servroid and userid
+ * valid   | invalid | invalid  | invalid | given xid
+ * valid   | invalid | invalid  | valid   | given xid and userid
+ * valid   | invalid | valid    | invalid | given xid, serverid
+ * valid   | invalid | valid    | valid   | given xid, serverid, userid
+ * valid   | valid   | invalid  | invalid | given xid and dbid 
+ * valid   | valid   | invalid  | valid   | given xid, dbid and userid
+ * valid   | valid   | valid    | invalid | given xid, dbid, serverid
+ * valid   | valid   | valid    | valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+		
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serveroid)
+			entry_matches = false;
+		
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+	
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	  		*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+		
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serveroid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+	
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+	
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+		
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serveroid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serveroids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serveroids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serveroids[nxacts] = fdw_xact->serveroid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serveroids[cnt], userids[cnt]);
+			
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serveroids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serveroids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+	
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+	
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+	
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+			
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+	
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+	
+	
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+	
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *    foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									PG_GETARG_TRANSACTIONID(XID_ARGNUM);
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+	
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serveroid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serveroid != serveroid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+	
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *    conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+	
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serveroid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serveroid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										serveroid, userid,
+										fdw_xact_file_data->fdw_xact_id_len,
+										fdw_xact_file_data->fdw_xact_id,
+										FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */	
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+	
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..cdbc583 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -7,20 +7,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 177d1e1..5c9aec7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -34,20 +34,21 @@
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <time.h>
 #include <unistd.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
@@ -1469,20 +1470,26 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		RelationCacheInitFilePostInvalidate();
 
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
 	else
 		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
 	/*
 	 * And now we can clean up our mess.
 	 */
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
 	MyLockedGxact = NULL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b53d95f..aaa0edc 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -14,20 +14,21 @@
  *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -179,20 +180,24 @@ typedef struct TransactionStateData
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -1884,20 +1889,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -1940,20 +1948,23 @@ CommitTransaction(void)
 
 		/*
 		 * Close open portals (converting holdable ones into static portals).
 		 * If there weren't any, we are done ... otherwise loop back to check
 		 * if they queued deferred triggers.  Lather, rinse, repeat.
 		 */
 		if (!PreCommit_Portals(false))
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
 	/*
 	 * The remaining actions cannot call any user-defined code, so it's safe
 	 * to start shutting down within-transaction services.  But note that most
 	 * of this stuff could still throw an error, which would switch us into
 	 * the transaction-abort path.
 	 */
 
@@ -2099,20 +2110,21 @@ CommitTransaction(void)
 	AtEOXact_GUC(true, 1);
 	AtEOXact_SPI(true);
 	AtEOXact_on_commit_actions(true);
 	AtEOXact_Namespace(true, is_parallel_worker);
 	AtEOXact_SMgr();
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
 	TopTransactionResourceOwner = NULL;
 
 	AtCommit_Memory();
 
@@ -2283,20 +2295,21 @@ PrepareTransaction(void)
 	 * before or after releasing the transaction's locks.
 	 */
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
 	 *
 	 * We have to record transaction prepares even if we didn't make any
 	 * updates, because the transaction manager might get confused if we lose
 	 * a global transaction.
 	 */
 	EndPrepare(gxact);
 
@@ -2565,20 +2578,21 @@ AbortTransaction(void)
 
 		AtEOXact_GUC(false, 1);
 		AtEOXact_SPI(false);
 		AtEOXact_on_commit_actions(false);
 		AtEOXact_Namespace(false, is_parallel_worker);
 		AtEOXact_SMgr();
 		AtEOXact_Files();
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
 	RESUME_INTERRUPTS();
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 68e33eb..68bc52d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -16,20 +16,21 @@
 
 #include <ctype.h>
 #include <time.h>
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -4872,20 +4873,21 @@ BootStrapXLOG(void)
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
 	WriteControlFile();
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5877,20 +5879,26 @@ CheckRequiredParameterValues(void)
 									 ControlFile->max_worker_processes);
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
 		RecoveryRequiresBoolParameter("track_commit_timestamp",
 									  track_commit_timestamp,
 									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresBoolParameter("track_commit_timestamp",
+									  track_commit_timestamp,
+									  ControlFile->track_commit_timestamp);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
 StartupXLOG(void)
 {
 	XLogCtlInsert *Insert;
@@ -6561,21 +6569,24 @@ StartupXLOG(void)
 		{
 			TransactionId *xids;
 			int			nxids;
 
 			ereport(DEBUG1,
 					(errmsg("initializing for hot standby")));
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
 
 			/* Tell procarray about the range of xids it has to deal with */
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
 			 * Startup commit log, commit timestamp and subtrans only.
 			 * MultiXact has already been started up and other SLRUs are not
@@ -7161,20 +7172,21 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7358,20 +7370,26 @@ StartupXLOG(void)
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
 	TrimCLOG();
 	TrimMultiXact();
 
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
 		close(readFile);
@@ -8632,20 +8650,25 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
 	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
  * Save a checkpoint for recovery restart if appropriate
  *
  * This function is called each time a checkpoint record is read from XLOG.
  * It must determine whether the checkpoint represents a safe restartpoint or
  * not.  If so, the checkpoint record is stashed in shared memory so that
  * CreateRestartPoint can consult it.  (Note that the latter function is
  * executed by the checkpointer, while this one will be executed by the
@@ -9057,56 +9080,59 @@ XLogRestorePoint(const char *rpName)
  */
 static void
 XLogReportParameters(void)
 {
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
 		 * if archiving is not enabled, as you can't start archive recovery
 		 * with wal_level=minimal anyway. We don't really care about the
 		 * values in pg_control either if wal_level=minimal, but seems better
 		 * to keep them up-to-date to avoid confusion.
 		 */
 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
 		{
 			xl_parameter_change xlrec;
 			XLogRecPtr	recptr;
 
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 			recptr = XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE);
 			XLogFlush(recptr);
 		}
 
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
 
 /*
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
  * Note: this function assumes there is no other process running
  * concurrently that could update it.
@@ -9280,20 +9306,21 @@ xlog_redo(XLogReaderState *record)
 		 */
 		if (standbyState >= STANDBY_INITIALIZED)
 		{
 			TransactionId *xids;
 			int			nxids;
 			TransactionId oldestActiveXID;
 			TransactionId latestCompletedXid;
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
 			 * down server, with only prepared transactions still alive. We're
 			 * never overflowed at this point because all subxids are listed
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
 			running.subxcnt = 0;
 			running.subxid_overflow = false;
@@ -9469,20 +9496,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
 		 * recover back up to this point before allowing hot standby again.
 		 * This is particularly important if wal_level was set to 'archive'
 		 * before, and is now 'hot_standby', to ensure you don't run queries
 		 * against the WAL preceding the wal_level change. Same applies to
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 95d6c14..3100f50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -11,20 +11,21 @@
  *	  src/backend/bootstrap/bootstrap.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/htup_details.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..4691e66 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -236,20 +236,29 @@ CREATE VIEW pg_available_extension_versions AS
            LEFT JOIN pg_extension AS X
              ON E.name = X.extname AND E.version = X.extversion;
 
 CREATE VIEW pg_prepared_xacts AS
     SELECT P.transaction, P.gid, P.prepared,
            U.rolname AS owner, D.datname AS database
     FROM pg_prepared_xact() AS P
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier" 
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
 	CASE WHEN rel.relkind = 'r' THEN 'table'::text
 		 WHEN rel.relkind = 'v' THEN 'view'::text
 		 WHEN rel.relkind = 'm' THEN 'materialized view'::text
 		 WHEN rel.relkind = 'S' THEN 'sequence'::text
@@ -933,10 +942,18 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'make_interval';
 
 CREATE OR REPLACE FUNCTION
   jsonb_set(jsonb_in jsonb, path text[] , replacement jsonb,
             create_if_missing boolean DEFAULT true)
 RETURNS jsonb
 LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'jsonb_set';
+
+CREATE OR REPLACE FUNCTION
+  pg_fdw_remove(transaction xid DEFAULT NULL, dbid oid DEFAULT NULL,
+				serverid oid DEFAULT NULL, userid oid DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+VOLATILE
+AS 'pg_fdw_remove';
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index cc912b2..3408252 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -6,20 +6,21 @@
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  *
  *
  * IDENTIFICATION
  *	  src/backend/commands/foreigncmds.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
@@ -1080,20 +1081,34 @@ RemoveForeignServerById(Oid srvId)
 	HeapTuple	tp;
 	Relation	rel;
 
 	rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
 
 	tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
 
 	heap_close(rel, RowExclusiveLock);
 }
 
 
 /*
  * Common routine to check permission for user-mapping-related DDL
@@ -1252,20 +1267,21 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
 
 	umId = GetSysCacheOid2(USERMAPPINGUSERSERVER,
 						   ObjectIdGetDatum(useId),
 						   ObjectIdGetDatum(srv->serverid));
 	if (!OidIsValid(umId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("user mapping \"%s\" does not exist for the server",
 						MappingUserName(useId))));
 
+
 	user_mapping_ddl_aclcheck(useId, srv->serverid, stmt->servername);
 
 	tp = SearchSysCacheCopy1(USERMAPPINGOID, ObjectIdGetDatum(umId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for user mapping %u", umId);
 
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
@@ -1378,20 +1394,31 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 		/* IF EXISTS specified, just note it */
 		ereport(NOTICE,
 		(errmsg("user mapping \"%s\" does not exist for the server, skipping",
 				MappingUserName(useId))));
 		return InvalidOid;
 	}
 
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
 	object.objectId = umId;
 	object.objectSubId = 0;
 
 	performDeletion(&object, DROP_CASCADE, 0);
 
 	return umId;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 000524d..37b7af6 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -86,20 +86,21 @@
 #ifdef USE_BONJOUR
 #include <dns_sd.h>
 #endif
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
 #include <pthread.h>
 #endif
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "lib/ilist.h"
 #include "libpq/auth.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
@@ -2494,21 +2495,20 @@ pmdie(SIGNAL_ARGS)
 				SignalUnconnectedWorkers(SIGTERM);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
 				/* and the bgwriter too */
 				if (BgWriterPID != 0)
 					signal_child(BgWriterPID, SIGTERM);
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-
 				/*
 				 * If we're in recovery, we can't kill the startup process
 				 * right away, because at present doing so does not release
 				 * its locks.  We might want to change this in a future
 				 * release.  For the time being, the PM_WAIT_READONLY state
 				 * indicates that we're waiting for the regular (read only)
 				 * backends to die off; once they do, we'll kill the startup
 				 * and walreceiver processes.
 				 */
 				pmState = (pmState == PM_RUN) ?
@@ -5736,20 +5736,21 @@ PostmasterMarkPIDForWorkerNotify(int pid)
 
 	dlist_foreach(iter, &BackendList)
 	{
 		bp = dlist_container(Backend, elem, iter.cur);
 		if (bp->pid == pid)
 		{
 			bp->bgworker_notify = true;
 			return true;
 		}
 	}
+
 	return false;
 }
 
 #ifdef EXEC_BACKEND
 
 /*
  * The following need to be available to the save/restore_backend_variables
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..6fdd818 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -127,20 +127,21 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_MULTIXACT_ID:
 		case RM_RELMAP_ID:
 		case RM_BTREE_ID:
 		case RM_HASH_ID:
 		case RM_GIN_ID:
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
 }
 
 /*
  * Handle rmgr XLOG_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a790e5b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -132,20 +133,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, ReplicationOriginShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -243,20 +245,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index 6b61765..d6cba87 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -15,21 +15,20 @@
 #include "postgres.h"
 
 #include <limits.h>
 
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 
-#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 #define PG_RETURN_TRANSACTIONID(x)	return TransactionIdGetDatum(x)
 
 #define PG_GETARG_COMMANDID(n)		DatumGetCommandId(PG_GETARG_DATUM(n))
 #define PG_RETURN_COMMANDID(x)		return CommandIdGetDatum(x)
 
 
 Datum
 xidin(PG_FUNCTION_ARGS)
 {
 	char	   *str = PG_GETARG_CSTRING(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..5037483 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -20,20 +20,21 @@
 #include <float.h>
 #include <math.h>
 #include <limits.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #ifdef HAVE_SYSLOG
 #include <syslog.h>
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
 #include "commands/trigger.h"
@@ -1999,20 +2000,33 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
 			NULL
 		},
 		&max_prepared_xacts,
 		0, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Sets the minimum OID of tables for tracking locks."),
 			gettext_noop("Is used to avoid output on system tables."),
 			GUC_NOT_IN_SAMPLE
 		},
 		&Trace_lock_oidmin,
 		FirstNormalObjectId, 0, INT_MAX,
 		NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e5d275d..2107f95 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -116,20 +116,26 @@
 					# (change requires restart)
 #huge_pages = try			# on, off, or try
 					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
 # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
 # per transaction slot, plus lock space (see max_locks_per_transaction).
 # It is not advisable to set max_prepared_transactions nonzero unless you
 # actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0		# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature. 
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
 					# supported by the operating system:
 					#   posix
 					#   sysv
 					#   windows
 					#   mmap
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index feeff9e..47ecf1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,31 +192,32 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
 	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d8cfe5e..00aad71 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -324,12 +324,14 @@ main(int argc, char *argv[])
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile.loblksize);
 	printf(_("Date/time type storage:               %s\n"),
 		   (ControlFile.enableIntTimes ? _("64-bit integers") : _("floating-point numbers")));
 	printf(_("Float4 argument passing:              %s\n"),
 		   (ControlFile.float4ByVal ? _("by value") : _("by reference")));
 	printf(_("Float8 argument passing:              %s\n"),
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile.max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 72755b0..e28328e 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -579,20 +579,21 @@ GuessControlValues(void)
 	ControlFile.unloggedLSN = 1;
 
 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
 	ControlFile.relseg_size = RELSEG_SIZE;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
 	ControlFile.indexMaxKeys = INDEX_MAX_KEYS;
@@ -795,20 +796,21 @@ RewriteControlFile(void)
 	 * Force the defaults for max_* settings. The values don't really matter
 	 * as long as wal_level='minimal'; the postmaster will reset these fields
 	 * anyway at startup.
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
 	ControlFile.xlog_seg_size = XLogSegSize;
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile.crc);
 	COMP_CRC32C(ControlFile.crc,
 				(char *) &ControlFile,
 				offsetof(ControlFileData, crc));
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 5b88a8d..82c6b51 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..d22cc47
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,74 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#include "storage/backendid.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+/* #include "foreign/fdwapi.h" */
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serveroid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serveroid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serveroid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..7272c33 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -37,11 +37,12 @@ PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify,
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
 PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb1c2db..d614ab6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -296,20 +296,21 @@ typedef struct xl_xact_parsed_abort
 	RelFileNode *xnodes;
 
 	TransactionId twophase_xid; /* only for 2PC */
 } xl_xact_parsed_abort;
 
 
 /* ----------------
  *		extern definitions
  * ----------------
  */
+#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 extern bool IsTransactionState(void);
 extern bool IsAbortedTransactionBlockState(void);
 extern TransactionId GetTopTransactionId(void);
 extern TransactionId GetTopTransactionIdIfAny(void);
 extern TransactionId GetCurrentTransactionId(void);
 extern TransactionId GetCurrentTransactionIdIfAny(void);
 extern TransactionId GetStableLatestTransactionId(void);
 extern SubTransactionId GetCurrentSubTransactionId(void);
 extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 5ebaa5f..c4d80e6 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -206,20 +206,21 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /*
  * Information logged when we detect a change in one of the parameters
  * important for Hot Standby.
  */
 typedef struct xl_parameter_change
 {
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
 typedef struct xl_restore_point
 {
 	TimestampTz rp_time;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..d168c32 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -173,20 +173,21 @@ typedef struct ControlFileData
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
 	 * or hot standby.
 	 */
 	int			wal_level;
 	bool		wal_log_hints;
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
 	 * machine of different endianness, but we do need to worry about MAXALIGN
 	 * and floating-point format.  (Note: storage layout nominally also
 	 * depends on SHORTALIGN and INTALIGN, but in practice these are the same
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ddf7c67..019aa7a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5266,20 +5266,26 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i 1 0 20 "2276" "{2276}" "{v}" _null_ _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4066 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4083 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4099 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3584 ( binary_upgrade_set_next_array_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_array_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3585 ( binary_upgrade_set_next_toast_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_toast_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3586 ( binary_upgrade_set_next_heap_pg_class_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_heap_pg_class_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..d1ddb4e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -5,20 +5,21 @@
  *
  * Copyright (c) 2010-2015, PostgreSQL Global Development Group
  *
  * src/include/foreign/fdwapi.h
  *
  *-------------------------------------------------------------------------
  */
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/xact.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
 
 
 /*
  * Callback function signatures --- see fdwhandler.sgml for more info.
  */
@@ -110,20 +111,32 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*EndForeignTransaction_function) (Oid serverOid, Oid userid,
+													bool is_commit);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverOid, Oid userid,
+														int prep_info_len,
+														char *prep_info);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverOid, Oid userid,
+															bool is_commit,
+														int prep_info_len,
+														char *prep_info);
+typedef char *(*GetPrepareId_function) (Oid serverOid, Oid userid,
+														int *prep_info_len);
+
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -165,20 +178,26 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for foreign transactions */
+	GetPrepareId_function				GetPrepareId;
+	EndForeignTransaction_function		EndForeignTransaction;
+	PrepareForeignTransaction_function	PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern Oid	GetForeignServerIdByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineByServerId(Oid serverid);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index f2ff6a0..c05e281 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -132,22 +132,23 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
 #define OldSerXidLock				(&MainLWLockArray[31].lock)
 #define SyncRepLock					(&MainLWLockArray[32].lock)
 #define BackgroundWorkerLock		(&MainLWLockArray[33].lock)
 #define DynamicSharedMemoryControlLock		(&MainLWLockArray[34].lock)
 #define AutoFileLock				(&MainLWLockArray[35].lock)
 #define ReplicationSlotAllocationLock	(&MainLWLockArray[36].lock)
 #define ReplicationSlotControlLock		(&MainLWLockArray[37].lock)
 #define CommitTsControlLock			(&MainLWLockArray[38].lock)
 #define CommitTsLock				(&MainLWLockArray[39].lock)
 #define ReplicationOriginLock		(&MainLWLockArray[40].lock)
+#define FDWXactLock					(&MainLWLockArray[41].lock)
 
-#define NUM_INDIVIDUAL_LWLOCKS		41
+#define NUM_INDIVIDUAL_LWLOCKS		42
 
 /*
  * It's a bit odd to declare NUM_BUFFER_PARTITIONS and NUM_LOCK_PARTITIONS
  * here, but we need them to figure out offsets within MainLWLockArray, and
  * having this file include lock.h or bufmgr.h would be backwards.
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
 
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 421bb58..47dcf13 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -222,25 +222,26 @@ typedef struct PROC_HDR
 } PROC_HDR;
 
 extern PROC_HDR *ProcGlobal;
 
 extern PGPROC *PreparedXactProcs;
 
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer, WAL writer and foreign transaction resolver
+ * run during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
 extern int	DeadlockTimeout;
 extern int	StatementTimeout;
 extern int	LockTimeout;
 extern bool log_lock_waits;
 
 
 /*
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index fc1679e..d31ceb0 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1257,11 +1257,15 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 44c6740..e71ffed 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1305,20 +1305,30 @@ pg_available_extensions| SELECT e.name,
     e.comment
    FROM (pg_available_extensions() e(name, default_version, comment)
      LEFT JOIN pg_extension x ON ((e.name = x.extname)));
 pg_cursors| SELECT c.name,
     c.statement,
     c.is_holdable,
     c.is_binary,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
     a.name,
     a.setting,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index dd65ab5..3c23446 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2224,37 +2224,40 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (system(buf))
 		{
 			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
 			exit(2);
 		}
 
 		/*
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
 		if (pg_conf == NULL)
 		{
 			fprintf(stderr, _("\n%s: could not open \"%s\" for adding extra config: %s\n"), progname, buf, strerror(errno));
 			exit(2);
 		}
 		fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
 		fputs("log_autovacuum_min_duration = 0\n", pg_conf);
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		if (temp_config != NULL)
 		{
 			FILE	   *extra_conf;
 			char		line_buf[1024];
 
 			extra_conf = fopen(temp_config, "r");
 			if (extra_conf == NULL)
 			{
 				fprintf(stderr, _("\n%s: could not open \"%s\" to read extra config: %s\n"), progname, temp_config, strerror(errno));

#43

Robert Haas

robertmhaas@gmail.com

about 10 years ago

In reply to: Ashutosh Bapat (#42)

Re: Transactions involving multiple postgres foreign servers

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

The recent eXtensible Transaction Manager and the slides shared at the
Vienna sharding summit, now posted at
https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
me think that some careful thought is needed here about what we want
and how it should work. Slide 10 proposes a method for the extensible
transaction manager API to interact with FDWs. The FDW would do this:

select dtm_join_transaction(xid);
begin transaction;
update...;
commit;

I think the idea here is that the commit command doesn't really
commit; it just escapes the distributed transaction while leaving it
marked not-committed. When the transaction subsequently commits on
the local server, the XID is marked committed and the effects of the
transaction become visible on all nodes.

I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.

On the other hand, I see a couple of problems:

1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.

2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Robert Haas (#43)

Re: Transactions involving multiple postgres foreign servers

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

The recent eXtensible Transaction Manager and the slides shared at the
Vienna sharding summit, now posted at
https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
me think that some careful thought is needed here about what we want
and how it should work. Slide 10 proposes a method for the extensible
transaction manager API to interact with FDWs. The FDW would do this:

select dtm_join_transaction(xid);
begin transaction;
update...;
commit;

I think the idea here is that the commit command doesn't really
commit; it just escapes the distributed transaction while leaving it
marked not-committed. When the transaction subsequently commits on
the local server, the XID is marked committed and the effects of the
transaction become visible on all nodes.

As per my reading of the slides shared by you, the commit in above
context would send a message to Arbiter which indicates it's Vote
for being ready to commit and when Arbiter gets the votes from all
nodes participating in transaction, it sends back an ok message
(this is what I could understand from slides 12 and 13). I think on
receiving ok message each node will mark the transaction as
committed.

I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

If I understood correctly, then the main difference between 2PC idea
used in this patch (considering we find some way of sharing snapshots
in this approach) and what is shared in slides is that XTM-based
approach relies on an external identity which it refers to as Arbiter for
performing consistent transaction commit/abort and sharing of snapshots
across all the nodes whereas in the approach in this patch, the transaction
originator (or we can call it as coordinator) is responsible for consistent
transaction commit/abort. I think the plus-point of XTM based approach is
that it provides way of sharing snapshots, but I think we still needs to
evaluate
what is the overhead of communication between these methods, as far as I
can see, in Arbiter based approach, Arbiter could become single point of
contention for coordinating messages for all the transactions in a system
whereas if we extend this approach such a contention could be avoided.
Now it is very well possible that the number of messages shared between
nodes in Arbiter based approach are lesser, but still contention could play
a
major role. Also another important point which needs some more thought
before concluding on any approach is detection of deadlocks
between different
nodes, in the slides shared by you, there is no discussion of deadlocks,
so it is not clear whether it will work as it is without any modification
or do
we need any modifications and deadlock detection system and if yes, then
how that will be achieved.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#45

Amit Kapila

amit.kapila16@gmail.com

about 10 years ago

In reply to: Amit Kapila (#44)

Re: Transactions involving multiple postgres foreign servers

On Sat, Nov 7, 2015 at 12:52 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

If I understood correctly, then the main difference between 2PC idea
used in this patch (considering we find some way of sharing snapshots
in this approach) and what is shared in slides is that XTM-based
approach

Read it as DTM-based approach.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#46

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 10 years ago

In reply to: Robert Haas (#43)

Re: Transactions involving multiple postgres foreign servers

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

The recent eXtensible Transaction Manager and the slides shared at the
Vienna sharding summit, now posted at
https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
me think that some careful thought is needed here about what we want
and how it should work. Slide 10 proposes a method for the extensible
transaction manager API to interact with FDWs. The FDW would do this:

select dtm_join_transaction(xid);
begin transaction;
update...;
commit;

I think the idea here is that the commit command doesn't really
commit; it just escapes the distributed transaction while leaving it
marked not-committed. When the transaction subsequently commits on
the local server, the XID is marked committed and the effects of the
transaction become visible on all nodes.

Since the foreign server (referred to in the slides as secondary server)
requires to call "create extension pg_dtm" and select
dtm_join_transaction(xid);, I assume that the foreign server has to be a
PostgreSQL server and one which has this extension installed and has a
version that can support this extension. So, we can not use the extension
for all FDWs and even for postgres_fdw it can be used only for a foreign
server with above capabilities. The slides mention just FDW but I think
they mean postgres_fdw and not all FDWs.

I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.

On the other hand, I see a couple of problems:

1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.

2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.

Elaborating more on this: Slide 11 shows arbiter protocol to start a
transaction and next slide shows the same for commit. Slide 15 shows the
transaction flow diagram for tsDTM. In DTM approach it doesn't specify how
xids are communicated between nodes, but it's implicit in the protocol that
xid space is shared by the nodes. Similarly for tsDTM it assumes that CSN
space is shared by all the nodes (see synchronization for max(CSN)). This
can not be assumed for FDWs (not even postgres_fdw) where foreign servers
are independent entities with independent xid space.

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

2PC for FDW and XTM are trying to solve different problems with some
commonality. 2PC for FDW is trying to solve problem of atomic commit (I am
borrowing from the terminology you used in PGCon 2015) for FDWs in general
(although limited to FDWs which can support 2 phase commit) and XTM tries
to solve problems of atomic visibility, atomic commit and consistency for
postgres_fdw where foreign servers support XTM. The only thing common
between these two is atomic visibility.

If we accept XTM and discard 2PC for FDW, we will not be able to support
atomic commit for FDWs in general. That, I think would be serious
limitation for Postgres FDW, esp. now that DMLs are allowed. If we accept
only 2PC for FDW and discard XTM, we won't be able to get atomic visibility
and consistency for postgres_fdw with foreign servers supporting XTM. That
would be again serious limitation for solutions implementing sharding,
multi-master clusters etc.

There are approaches like [1] by which cluster of heterogenous servers
(with some level of snapshot isolation) can be constructed. Ideally that
will enable PostgreSQL users to maximize their utilization of FDWs.

Any distributed transaction management requires 2PC in some or other form.
So, we should implement 2PC for FDW keeping in mind various forms of 2PC
used practically. Use that infrastructure to build XTM like capabilities
for restricted postgres_fdw uses. Previously, I have requested the authors
of XTM to look at my patch and provide me feedback about their requirements
for implementing 2PC part of XTM. But I have not heard anything from them.

1.
https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#47

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 10 years ago

In reply to: Robert Haas (#43)

Re: Transactions involving multiple postgres foreign servers

On 06.11.2015 21:37, Robert Haas wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

The recent eXtensible Transaction Manager and the slides shared at the
Vienna sharding summit, now posted at
https://drive.google.com/file/d/0B8hhdhUVwRHyMXpRRHRSLWFXeXc/view make
me think that some careful thought is needed here about what we want
and how it should work. Slide 10 proposes a method for the extensible
transaction manager API to interact with FDWs. The FDW would do this:

select dtm_join_transaction(xid);
begin transaction;
update...;
commit;

I think the idea here is that the commit command doesn't really
commit; it just escapes the distributed transaction while leaving it
marked not-committed. When the transaction subsequently commits on
the local server, the XID is marked committed and the effects of the
transaction become visible on all nodes.

I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.

On the other hand, I see a couple of problems:

1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.

2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

Sorry, but we currently considered only case of homogeneous environment:
when all cluster instances are using PostgreSQL with the same XTM
implementation.
I can imagine situations when it may be useful to coordinate transaction
processing in heterogeneous cluster, but it seems to be quite exotic use
case.
Combining several different databases on one cluster can be explained by
some historical reasons or specific of particular system architecture.
But I can not imagine any reason for using different XTM implementations
and especially mixing them in one transaction.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 10 years ago

In reply to: Ashutosh Bapat (#46)

Re: Transactions involving multiple postgres foreign servers

On 09.11.2015 09:59, Ashutosh Bapat wrote:

Since the foreign server (referred to in the slides as secondary
server) requires to call "create extension pg_dtm" and select
dtm_join_transaction(xid);, I assume that the foreign server has to be
a PostgreSQL server and one which has this extension installed and has
a version that can support this extension. So, we can not use the
extension for all FDWs and even for postgres_fdw it can be used only
for a foreign server with above capabilities. The slides mention just
FDW but I think they mean postgres_fdw and not all FDWs.

DTM approach is based on sharing XIDs and snapshots between different
cluster nodes, so it really can be easily implemented only for
PostgreSQL. So I really have in mind postgres_fdw rather than abstract FDW.
Approach with timestamps is more universal and in principle can be used
for any DBMS where visibility is based on CSNs.

I think that this API is intended to provide not only consistent
cross-node decisions about whether a particular transaction has
committed, but also consistent visibility. If the API is sufficient
for that and if it can be made sufficiently performant, that's a
strictly stronger guarantee than what this proposal would provide.

On the other hand, I see a couple of problems:

1. The extensible transaction manager API is meant to be pluggable.
Depending on which XTM module you choose to load, the SQL that needs
to be executed by postgres_fdw on the remote node will vary.
postgres_fdw shouldn't have knowledge of all the possible XTMs out
there, so it would need some way to know what SQL to send.

2. If the remote server isn't running the same XTM as the local
server, or if it is running the same XTM but is not part of the same
group of cooperating nodes as the local server, then we can't send a
command to join the distributed transaction at all. In that case, the
2PC for FDW approach is still, maybe, useful.

Elaborating more on this: Slide 11 shows arbiter protocol to start a
transaction and next slide shows the same for commit. Slide 15 shows
the transaction flow diagram for tsDTM. In DTM approach it doesn't
specify how xids are communicated between nodes, but it's implicit in
the protocol that xid space is shared by the nodes. Similarly for
tsDTM it assumes that CSN space is shared by all the nodes (see
synchronization for max(CSN)). This can not be assumed for FDWs (not
even postgres_fdw) where foreign servers are independent entities with
independent xid space.

Proposed architecture of DTM includes "coordinator". Coordinator is a
process responsible for managing logic of distributed transaction. It
can be just a normal client application, or it can be intermediate
master node (like in case of pg_shard).
It can be also PostgreSQL instance (as in case of postgres_fdw) or not.
We try to put as less restriction on "coordinator" as possible.
It should just communicate with PostgreSQL backends using any
communication protocol it likes (i.e. libpq) and invokes some special
stored procedures which are part of particular DTM extension. Such
functions also impose some protocol of exchanging data between different
nodes involved in distributed transaction. In such way we are
propagating XIDs/CSNs between different nodes which may even do not know
about each other.
In DTM approach nodes only know about location of "arbiter". In tsDTM
approach there is even not arbiter...

On the whole, I'm inclined to think that the XTM-based approach is
probably more useful and more general, if we can work out the problems
with it. I'm not sure that I'm right, though, nor am I sure how hard
it will be.

2PC for FDW and XTM are trying to solve different problems with some
commonality. 2PC for FDW is trying to solve problem of atomic commit
(I am borrowing from the terminology you used in PGCon 2015) for FDWs
in general (although limited to FDWs which can support 2 phase commit)
and XTM tries to solve problems of atomic visibility, atomic commit
and consistency for postgres_fdw where foreign servers support XTM.
The only thing common between these two is atomic visibility.

If we accept XTM and discard 2PC for FDW, we will not be able to
support atomic commit for FDWs in general. That, I think would be
serious limitation for Postgres FDW, esp. now that DMLs are allowed.
If we accept only 2PC for FDW and discard XTM, we won't be able to get
atomic visibility and consistency for postgres_fdw with foreign
servers supporting XTM. That would be again serious limitation for
solutions implementing sharding, multi-master clusters etc.

There are approaches like [1] by which cluster of heterogenous servers
(with some level of snapshot isolation) can be constructed. Ideally
that will enable PostgreSQL users to maximize their utilization of FDWs.

Any distributed transaction management requires 2PC in some or other
form. So, we should implement 2PC for FDW keeping in mind various
forms of 2PC used practically. Use that infrastructure to build XTM
like capabilities for restricted postgres_fdw uses. Previously, I have
requested the authors of XTM to look at my patch and provide me
feedback about their requirements for implementing 2PC part of XTM.
But I have not heard anything from them.

1.
https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf

Sorry, may be I missed some message. but I have not received request
from you to provide feedback concerning your patch.

Show quoted text

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#49

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 10 years ago

In reply to: Konstantin Knizhnik (#48)

Re: Transactions involving multiple postgres foreign servers

Any distributed transaction management requires 2PC in some or other form.
So, we should implement 2PC for FDW keeping in mind various forms of 2PC
used practically. Use that infrastructure to build XTM like capabilities
for restricted postgres_fdw uses. Previously, I have requested the authors
of XTM to look at my patch and provide me feedback about their requirements
for implementing 2PC part of XTM. But I have not heard anything from them.

1.
https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf

Sorry, may be I missed some message. but I have not received request from
you to provide feedback concerning your patch.

See my mail on 31st August on hackers in the thread with subject
"Horizontal scalability/sharding".

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#50

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 10 years ago

In reply to: Robert Haas (#43)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Attachments:

pg_fdw_transact_v1.patchtext/x-diff; charset=US-ASCII; name=pg_fdw_transact_v1.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/* 
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL; 
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid); 
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/* 
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 1a1e5b5..341db6f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -8,20 +8,22 @@
  * IDENTIFICATION
  *		  contrib/postgres_fdw/connection.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 
 
 /*
  * Connection cache hash table entry
  *
  * The lookup key in this hash table is the foreign server OID plus the user
@@ -57,52 +59,59 @@ typedef struct ConnCacheEntry
 static HTAB *ConnectionHash = NULL;
 
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
 
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+									bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
  * server with the user's authorization.  A new connection is established
  * if we don't already have a suitable one, and a transaction is opened at
  * the right subtransaction nesting depth if we didn't do that already.
  *
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
  * syscaches.  For the moment, though, it's not clear that this would really
  * be useful and not mere pedantry.  We could not flush any active connections
  * mid-transaction anyway.
  */
 PGconn *
 GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt)
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
 		HASHCTL		ctl;
 
@@ -116,23 +125,20 @@ GetConnection(ForeignServer *server, UserMapping *user,
 									 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 
 		/*
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
 		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key.serverid = server->serverid;
 	key.userid = user->userid;
 
 	/*
 	 * Find or create cached entry for requested connection.
 	 */
 	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
 	if (!found)
 	{
@@ -152,41 +158,64 @@ GetConnection(ForeignServer *server, UserMapping *user,
 	/*
 	 * If cache entry doesn't have a connection, we have to establish a new
 	 * connection.  (If connect_pg_server throws an error, the cache entry
 	 * will be left in a valid empty state.)
 	 */
 	if (entry->conn == NULL)
 	{
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 	server->servername);
+			return NULL;
+		}
+
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\"",
 			 entry->conn, server->servername);
 	}
 
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, server);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
+
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
 
 	return entry->conn;
 }
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+					bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
 	/*
 	 * Use PG_TRY block to ensure closing connection on error.
 	 */
 	PG_TRY();
 	{
 		const char **keywords;
 		const char **values;
@@ -227,25 +256,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
 		{
 			char	   *connmessage;
 			int			msglen;
 
 			/* libpq typically appends a newline, strip that */
 			connmessage = pstrdup(PQerrorMessage(conn));
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+					   		server->servername),
+						errdetail_internal("%s", connmessage)));
 		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
 		 * otherwise, he's piggybacking on the postgres server's user
 		 * identity. See also dblink_security_check() in contrib/dblink.
 		 */
 		if (!superuser() && !PQconnectionUsedPassword(conn))
 			ereport(ERROR,
 				  (errcode(ERRCODE_S_R_E_PROHIBITED_SQL_STATEMENT_ATTEMPTED),
@@ -362,29 +395,36 @@ do_sql_command(PGconn *conn, const char *sql)
  * Start remote transaction or subtransaction, if needed.
  *
  * Note that we always use at least REPEATABLE READ in the remote session.
  * This is so that, if a query initiates multiple scans of the same or
  * different foreign tables, we will get snapshot-consistent results from
  * those scans.  A disadvantage is that we can't provide sane emulation of
  * READ COMMITTED behavior --- it would be nice if we had some other way to
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, ForeignServer *server)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible. 
+		 */
+		RegisterXactForeignServer(entry->key.serverid, entry->key.userid,
+									server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
 		if (IsolationIsSerializable())
 			sql = "START TRANSACTION ISOLATION LEVEL SERIALIZABLE";
 		else
 			sql = "START TRANSACTION ISOLATION LEVEL REPEATABLE READ";
 		do_sql_command(entry->conn, sql);
 		entry->xact_depth = 1;
 	}
@@ -506,148 +546,295 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 		if (clear)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 	if (clear)
 		PQclear(res);
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Connection hash should have a connection we want */
+		
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key.serverid = serverid;
+	key.userid = userid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
+		
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key.serverid = serverid;
+		key.userid = userid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+	if (!conn && IsTransactionState())
+	{
+		ForeignServer	*foreign_server = GetForeignServer(serverid); 
+		UserMapping		*user_mapping = GetUserMapping(userid, serverid);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+		conn = GetConnection(foreign_server, user_mapping, false, false, true);
+	}
 
-			switch (event)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+	
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+	
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.  We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
 	 */
 	xact_got_connection = false;
 
 	/* Also reset cursor numbering for next transaction */
 	cursor_number = 0;
 }
 
 /*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
@@ -708,10 +895,33 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 			if (PQresultStatus(res) != PGRES_COMMAND_OK)
 				pgfdw_report_error(WARNING, res, entry->conn, true, sql);
 			else
 				PQclear(res);
 		}
 
 		/* OK, we're outta that level of subtransaction */
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+	
+	/* Check the options for two phase compliance */ 
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 866a09b..0c52753 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3779,10 +3779,348 @@ ERROR:  type "public.Colors" does not exist
 LINE 4:   "Col" public."Colors" OPTIONS (column_name 'Col')
                 ^
 QUERY:  CREATE FOREIGN TABLE t5 (
   c1 integer OPTIONS (column_name 'c1'),
   c2 text OPTIONS (column_name 'c2') COLLATE pg_catalog."C",
   "Col" public."Colors" OPTIONS (column_name 'Col')
 ) SERVER loopback
 OPTIONS (schema_name 'import_source', table_name 't5');
 CONTEXT:  importing foreign table "t5"
 ROLLBACK;
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_with_fdw           | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_with_fdw           | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+ERROR:  duplicate key value violates unique constraint "lt_val_key"
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     2
+(1 row)
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+ pg_fdw_remove 
+---------------
+ 
+(1 row)
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+ count 
+-------
+     0
+(1 row)
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+(2 rows)
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+ database | foreign server | local user | status 
+----------+----------------+------------+--------
+(0 rows)
+
+SELECT database, gid FROM pg_prepared_xacts;
+ database | gid 
+----------+-----
+(0 rows)
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+      database      | local transaction identifier | foreign server | local user |  status  
+--------------------+------------------------------+----------------+------------+----------
+ contrib_regression | prep_xact_fdw_subxact        | loopback1      | ashutosh   | prepared
+ contrib_regression | prep_xact_fdw_subxact        | loopback2      | ashutosh   | prepared
+(2 rows)
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+ val 
+-----
+   3
+   4
+  10
+(3 rows)
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+ERROR:  can not prepare transaction on foreign server loopback1
+SELECT * FROM lt;
+ val 
+-----
+   1
+   2
+(2 rows)
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+WARNING:  duplicate key value violates unique constraint "lt_val_key"
+WARNING:  could not commit transaction on server loopback2
+SELECT * FROM lt;
+ val 
+-----
+   1
+   3
+   2
+(3 rows)
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+ transaction | database | foreign server | local user | status | foreign transaction identifier 
+-------------+----------+----------------+------------+--------+--------------------------------
+(0 rows)
+
+DROP SERVER loopback1 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP SERVER loopback2 CASCADE;
+NOTICE:  drop cascades to 2 other objects
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 380ac80..32a6247 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -100,21 +100,22 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname),
 					 errhint("Valid options in this context are: %s",
 							 buf.data)));
 		}
 
 		/*
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
 		}
 		else if (strcmp(def->defname, "fdw_startup_cost") == 0 ||
 				 strcmp(def->defname, "fdw_tuple_cost") == 0)
 		{
 			/* these must have a non-negative numeric value */
 			double		val;
 			char	   *endp;
@@ -155,20 +156,22 @@ InitPgFdwOptions(void)
 		{"use_remote_estimate", ForeignServerRelationId, false},
 		{"use_remote_estimate", ForeignTableRelationId, false},
 		/* cost factors */
 		{"fdw_startup_cost", ForeignServerRelationId, false},
 		{"fdw_tuple_cost", ForeignServerRelationId, false},
 		/* shippable extensions */
 		{"extensions", ForeignServerRelationId, false},
 		/* updatable is available on both server and table */
 		{"updatable", ForeignServerRelationId, false},
 		{"updatable", ForeignTableRelationId, false},
+		/* 2PC compatibility */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
 	/* Prevent redundant initialization. */
 	if (postgres_fdw_options)
 		return;
 
 	/*
 	 * Get list of valid libpq options.
 	 *
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index cd4ed0c..3f765e3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -9,20 +9,22 @@
  *		  contrib/postgres_fdw/postgres_fdw.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include "postgres_fdw.h"
 
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/cost.h"
 #include "optimizer/pathnode.h"
@@ -332,20 +334,26 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for EXPLAIN */
 	routine->ExplainForeignScan = postgresExplainForeignScan;
 	routine->ExplainForeignModify = postgresExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	routine->AnalyzeForeignTable = postgresAnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	routine->ImportForeignSchema = postgresImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
 /*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
  * We should consider the effect of all baserestrictinfo clauses here, but
  * not any join clauses.
  */
@@ -959,21 +967,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	/* Get info about foreign table. */
 	fsstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(fsstate->rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(server, user, false);
+	fsstate->conn = GetConnection(server, user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
 
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
 	fsstate->retrieved_attrs = (List *) list_nth(fsplan->fdw_private,
 											   FdwScanPrivateRetrievedAttrs);
@@ -1357,21 +1365,21 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	 */
 	rte = rt_fetch(resultRelInfo->ri_RangeTableIndex, estate->es_range_table);
 	userid = rte->checkAsUser ? rte->checkAsUser : GetUserId();
 
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(userid, server->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(server, user, true);
+	fmstate->conn = GetConnection(server, user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
 	fmstate->query = strVal(list_nth(fdw_private,
 									 FdwModifyPrivateUpdateSql));
 	fmstate->target_attrs = (List *) list_nth(fdw_private,
 											  FdwModifyPrivateTargetAttnums);
 	fmstate->has_returning = intVal(list_nth(fdw_private,
 											 FdwModifyPrivateHasReturning));
 	fmstate->retrieved_attrs = (List *) list_nth(fdw_private,
@@ -1811,21 +1819,21 @@ estimate_path_cost_size(PlannerInfo *root,
 			appendWhereClause(&sql, root, baserel, fpinfo->remote_conds,
 							  true, NULL);
 		if (remote_join_conds)
 			appendWhereClause(&sql, root, baserel, remote_join_conds,
 							  (fpinfo->remote_conds == NIL), NULL);
 
 		if (pathkeys)
 			appendOrderByClause(&sql, root, baserel, pathkeys);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->server, fpinfo->user, false);
+		conn = GetConnection(fpinfo->server, fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
 
 		retrieved_rows = rows;
 
 		/* Factor in the selectivity of the locally-checked quals */
 		local_sel = clauselist_selectivity(root,
 										   local_join_conds,
 										   baserel->relid,
@@ -2390,21 +2398,21 @@ postgresAnalyzeForeignTable(Relation relation,
 	 * it's probably not worth redefining that API at this point.
 	 */
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
 	 */
 	initStringInfo(&sql);
 	deparseAnalyzeSizeSql(&sql, relation);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
 	{
@@ -2482,21 +2490,21 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 											ALLOCSET_SMALL_INITSIZE,
 											ALLOCSET_SMALL_MAXSIZE);
 
 	/*
 	 * Get the connection to use.  We do the remote access as the table's
 	 * owner, even if the ANALYZE was started by some other user.
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, server->serverid);
-	conn = GetConnection(server, user, false);
+	conn = GetConnection(server, user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
 	 */
 	cursor_number = GetCursorNumber(conn);
 	initStringInfo(&sql);
 	appendStringInfo(&sql, "DECLARE c%u CURSOR FOR ", cursor_number);
 	deparseAnalyzeSql(&sql, relation, &astate.retrieved_attrs);
 
 	/* In what follows, do not risk leaking any PGresults. */
@@ -2683,21 +2691,21 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 					(errcode(ERRCODE_FDW_INVALID_OPTION_NAME),
 					 errmsg("invalid option \"%s\"", def->defname)));
 	}
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(server, mapping, false);
+	conn = GetConnection(server, mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
 		import_collate = false;
 
 	/* Create workspace for strings */
 	initStringInfo(&buf);
 
 	/* In what follows, do not risk leaking any PGresults. */
 	PG_TRY();
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f243de8..e61157f 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -10,20 +10,21 @@
  *
  *-------------------------------------------------------------------------
  */
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
 #include "utils/relcache.h"
+#include "access/fdw_xact.h"
 
 #include "libpq-fe.h"
 
 /*
  * FDW-specific planner information kept in RelOptInfo.fdw_private for a
  * foreign table.  This information is collected by postgresGetForeignRelSize.
  */
 typedef struct PgFdwRelationInfo
 {
 	/* baserestrictinfo clauses, broken down into safe and unsafe subsets. */
@@ -54,21 +55,22 @@ typedef struct PgFdwRelationInfo
 	ForeignServer *server;
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 } PgFdwRelationInfo;
 
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(ForeignServer *server, UserMapping *user,
-			  bool will_prep_stmt);
+			  bool will_prep_stmt, bool start_transaction,
+			  bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
 						 const char **keywords,
 						 const char **values);
@@ -104,19 +106,26 @@ extern void deparseUpdateSql(StringInfo buf, PlannerInfo *root,
 				 List *targetAttrs, List *returningList,
 				 List **retrieved_attrs);
 extern void deparseDeleteSql(StringInfo buf, PlannerInfo *root,
 				 Index rtindex, Relation rel,
 				 List *returningList,
 				 List **retrieved_attrs);
 extern void deparseAnalyzeSizeSql(StringInfo buf, Relation rel);
 extern void deparseAnalyzeSql(StringInfo buf, Relation rel,
 				  List **retrieved_attrs);
 extern void deparseStringLiteral(StringInfo buf, const char *val);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+											bool is_commit,
+											int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, bool is_commit);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+									char *prep_info);
 extern Expr *find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel);
 extern void appendOrderByClause(StringInfo buf, PlannerInfo *root,
 					RelOptInfo *baserel, List *pathkeys);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
 extern bool is_shippable(Oid objectId, Oid classId, PgFdwRelationInfo *fpinfo);
 
 #endif   /* POSTGRES_FDW_H */
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 671e38c..b6fe637 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -873,10 +873,239 @@ IMPORT FOREIGN SCHEMA nonesuch FROM SERVER nowhere INTO notthere;
 -- We can fake this by dropping the type locally in our transaction.
 CREATE TYPE "Colors" AS ENUM ('red', 'green', 'blue');
 CREATE TABLE import_source.t5 (c1 int, c2 text collate "C", "Col" "Colors");
 
 CREATE SCHEMA import_dest5;
 BEGIN;
 DROP TYPE "Colors" CASCADE;
 IMPORT FOREIGN SCHEMA import_source LIMIT TO (t5)
   FROM SERVER loopback INTO import_dest5;  -- ERROR
 ROLLBACK;
+
+-- This will suppress the context of errors, which contains prepared transaction
+-- IDs. Those come out to be different each time.
+\set VERBOSITY terse
+-- Test transactional consistency for multiple server case
+-- create two loopback servers for testing consistency on two connections
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback1 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+DO $d$
+    BEGIN
+        EXECUTE $$CREATE SERVER loopback2 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$',
+					 two_phase_commit 'true'
+            )$$;
+    END;
+$d$;
+
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+
+-- create a local table to refer to as foreign table. Add a row. The table has
+-- constraints which are deferred till end of transaction. This allows commit
+-- time errors to occur by inserting data which violates constraints.
+CREATE TABLE lt(val int UNIQUE DEFERRABLE INITIALLY DEFERRED);
+-- create two foreign tables each on separate server referring to the local table.
+CREATE FOREIGN TABLE ft1_lt (val int) SERVER loopback1 OPTIONS (table_name 'lt');
+CREATE FOREIGN TABLE ft2_lt (val int) SERVER loopback2 OPTIONS (table_name 'lt');
+
+-- test prepared transactions with foreign servers
+-- test for commit prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (3);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+-- prepared transactions should be seen in the system view
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- test for rollback prepared
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+SELECT * FROM lt;
+
+-- In a transaction insert two rows one each to the two foreign tables. One of
+-- the rows violates the constraint and other not. At the time of commit
+-- constraints on one of the server will rollback transaction on that server in
+-- turn rolling back the whole transaction.
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1); -- Violates constraint
+	INSERT INTO ft2_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (4);
+	INSERT INTO ft2_lt VALUES (3); -- Violates constraint
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Transaction involving local changes and remote changes, one of them or both
+-- violating the constraints
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints
+	INSERT INTO ft1_lt VALUES (5);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (6);
+	INSERT INTO ft1_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (1); -- violates constraints 
+	INSERT INTO ft1_lt VALUES (3); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- Multiple foreign servers with local changes
+BEGIN TRANSACTION;
+	INSERT INTO lt VALUES (7);
+	INSERT INTO ft1_lt VALUES (8);
+	INSERT INTO ft2_lt VALUES (1); -- violates constraints
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- test for removing foreign transactions 
+BEGIN;
+	INSERT INTO ft1_lt VALUES (10);
+	INSERT INTO ft2_lt VALUES (30);
+PREPARE TRANSACTION 'prep_xact_with_fdw';
+
+-- get the transaction identifiers for foreign servers loopback1 and loopback2
+SELECT "foreign transaction identifier" AS lbs1_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback1'
+\gset
+SELECT "foreign transaction identifier" AS lbs2_id FROM pg_fdw_xacts WHERE "foreign server" = 'loopback2'
+\gset
+-- Rollback the transactions with identifiers collected above. The foreign
+-- servers are pointing to self, so the transactions are local.
+ROLLBACK PREPARED :'lbs1_id';
+ROLLBACK PREPARED :'lbs2_id';
+-- Get the xid of parent transaction into a variable. The foreign
+-- transactions corresponding to this xid are removed later.
+SELECT transaction AS rem_xid FROM pg_prepared_xacts
+\gset
+
+-- There should be 2 entries corresponding to the prepared foreign transactions
+-- on two foreign servers.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Remove the prepared foreign transaction entries.
+SELECT pg_fdw_remove(:'rem_xid'::xid);
+
+-- There should be no foreign prepared transactions now.
+SELECT count(*) FROM pg_fdw_xacts WHERE transaction = :rem_xid;
+
+-- Rollback the parent transaction to release any resources
+ROLLBACK PREPARED 'prep_xact_with_fdw';
+-- source table should be in-tact
+SELECT * FROM lt;
+
+-- test for failing prepared transaction
+BEGIN;
+	INSERT INTO ft1_lt VALUES (1); -- violates constraint, so prepare should fail
+	INSERT INTO ft2_lt VALUES (2);
+PREPARE TRANSACTION 'prep_fdw_xact_failure'; -- should fail
+-- We shouldn't see anything, the transactions prepared on the foreign servers
+-- should be rolled back.
+SELECT database, "foreign server", "local user", status FROM pg_fdw_xacts;
+SELECT database, gid FROM pg_prepared_xacts;
+
+-- subtransactions with foreign servers
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+BEGIN TRANSACTION;
+	INSERT INTO ft1_lt VALUES (1);
+	INSERT INTO ft2_lt VALUES (2);
+	SAVEPOINT sv1;
+		UPDATE ft1_lt SET val = val + 1;
+		UPDATE ft2_lt SET val = val + 1;
+	ROLLBACK TO SAVEPOINT sv1;
+	SAVEPOINT sv2;
+		UPDATE ft1_lt SET val = val + 2;
+		UPDATE ft2_lt SET val = val + 2;
+	RELEASE SAVEPOINT sv2;
+	INSERT INTO lt VALUES (10);
+PREPARE TRANSACTION 'prep_xact_fdw_subxact';
+-- only top transaction's xid should be recorded, not that of subtransactions'
+SELECT P.database, P.gid AS "local transaction identifier",
+		"foreign server", "local user", status
+		FROM pg_fdw_xacts F
+			LEFT JOIN pg_prepared_xacts P ON F.transaction = P.transaction
+		WHERE P.database = F.database;	-- WHERE condition is actually an assertion
+
+COMMIT PREPARED 'prep_xact_fdw_subxact';
+SELECT * FROM lt;
+
+-- What if one of the servers involved in a transaction isn't capable of 2PC?
+-- Those servers capable of two phase commit, will commit their transactions
+-- atomically with the local transaction. The transactions on the incapable
+-- servers will be committed independent of the outcome of the other foreign
+-- transactions.
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+ALTER SERVER loopback2 OPTIONS (SET two_phase_commit 'false'); 
+-- Changes to the local server and the loopback1 will be rolled back as prepare
+-- on loopback1 would fail because of constraint violation. But the changes on
+-- loopback2, which doesn't execute two phase commit, will be committed.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (2);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (1);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+TRUNCATE TABLE lt;
+INSERT INTO lt VALUES (1);
+
+-- Changes to all the servers, local and foreign, will be rolled back as those
+-- on loopback2 (incapable of two-phase commit) could not be commited.
+BEGIN TRANSACTION;
+	INSERT INTO ft2_lt VALUES (1);
+	INSERT INTO lt VALUES (3);
+	INSERT INTO ft1_lt VALUES (2);
+COMMIT TRANSACTION;
+SELECT * FROM lt;
+
+-- At the end, we should not have any foreign transaction remaining unresolved
+SELECT * FROM pg_fdw_xacts;
+
+DROP SERVER loopback1 CASCADE;
+DROP SERVER loopback2 CASCADE;
+DROP TABLE lt;
+\set VERBOSITY default
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5549de7..3418c81 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1411,20 +1411,62 @@ include_dir 'conf.d'
        </para>
 
        <para>
         When running a standby server, you must set this parameter to the
         same or higher value than on the master server. Otherwise, queries
         will not be allowed in the standby server.
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
        <primary><varname>work_mem</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
         Specifies the amount of memory to be used by internal sort operations
         and hash tables before writing to temporary disk files. The value
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 1533a6b..247aa09 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -918,20 +918,85 @@ ImportForeignSchema (ImportForeignSchemaStmt *stmt, Oid serverOid);
      useful to test whether a given foreign-table name will pass the filter.
     </para>
 
     <para>
      If the FDW does not support importing table definitions, the
      <function>ImportForeignSchema</> pointer can be set to <literal>NULL</>.
     </para>
 
    </sect2>
 
+   <sect2 id="fdw-callbacks-transactions">
+    <title>FDW Routines For transaction management</title>
+
+    <para>
+<programlisting>
+char *
+GetPrepareInfo (Oid serverOid, Oid userid, int *prep_info_len);
+</programlisting>
+
+     Get prepared transaction identifier for given foreign server and user.
+     This function is called when executing <xref linkend="sql-commit">, if
+     <literal>atomic_foreign_transaction</> is enabled. It should return a
+     valid transaction identifier that will be used to prepare transaction
+     on the foreign server. The <parameter>prep_info_len</> should be set
+     to the length of this identifier. The identifier should not be longer
+     than 256 bytes. The identifier should not cause conflict with existing
+     identifiers on the foreign server. It should be unique enough not to
+     identify a transaction in future. It's possible that a transaction is
+     considered unresolved on <productname>PostgreSQL</> while it is resolved
+     in reality. This causes the foreign transaction resolver to try resolving
+     the transaction till it finds out that the transaction has been resolved.
+     In case the transaction identifier is same as a future transaction identifier
+     there is a possibility of the future transaction getting resolved
+     erroneously.
+    </para>
+
+    <para>
+     If a foreign server with Foreign Data Wrapper having <literal>NULL</>
+      <function>GetPrepareInfo</> pointer participates in a transaction
+      with<literal>atomic_foreign_transaction</> enabled, the transaction
+      is aborted.
+    </para>
+
+    <para>
+<programlisting>
+bool
+HandleForeignTransaction (Oid serverOid, Oid userid, FDWXactAction action,
+                            int prep_id_len, char *prep_id)
+</programlisting>
+
+     Function to end a transaction on the given foreign server with given user.
+     This function is called when executing <xref linkend="sql-commit"> or
+     <xref linkend="sql-rollback">. The function should complete a transaction
+     according to the <parameter>action</> specified. The function should
+     return TRUE on successful completion of transaction and FALSE otherwise.
+     It should not throw an error in case of failure to complete the transaction.
+    </para>
+
+    <para>
+    When <parameter>action</> is FDW_XACT_COMMIT or FDW_XACT_ABORT, the function
+    should commit or rollback the running transaction resp. When <parameter>action</>
+    is FDW_XACT_PREPARE, the running transaction should be prepared with the
+    identifier given by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    When <parameter>action</> is FDW_XACT_ABORT_PREPARED or FDW_XACT_COMMIT_PREPARED
+    the function should respectively commit or rollback the transaction identified
+    by <parameter>prep_id</> and <parameter>prep_id_len</>.
+    </para>
+
+    <para>
+    This function is usually called at the end of the transaction, when the
+    access to the database may not be possible. Trying to access catalogs
+    in this function may cause error to be thrown and can affect other foreign
+    data wrappers. 
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
     <title>Foreign Data Wrapper Helper Functions</title>
 
     <para>
      Several helper functions are exported from the core server so that
      authors of foreign data wrappers can get easy access to attributes of
      FDW-related objects, such as FDW options.
      To use any of these functions, you need to include the header file
@@ -1318,11 +1383,93 @@ GetForeignServerByName(const char *name, bool missing_ok);
     <para>
      See <filename>src/include/nodes/lockoptions.h</>, the comments
      for <type>RowMarkType</> and <type>PlanRowMark</>
      in <filename>src/include/nodes/plannodes.h</>, and the comments for
      <type>ExecRowMark</> in <filename>src/include/nodes/execnodes.h</> for
      additional information.
     </para>
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..8c1afcf 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -1,16 +1,16 @@
 #
 # Makefile for the rmgr descriptor routines
 #
 # src/backend/access/rmgrdesc/Makefile
 #
 
 subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
-	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o gindesc.o \
+	   gistdesc.o hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
 	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..0f0c899
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module describes the WAL records for foreign transaction manager. 
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serveroid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 83cc9e8..041964a 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -104,28 +104,29 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 			if (entry->val == xlrec.wal_level)
 			{
 				wal_level_str = entry->name;
 				break;
 			}
 		}
 
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
 		bool		fpw;
 
 		memcpy(&fpw, rec, sizeof(bool));
 		appendStringInfoString(buf, fpw ? "true" : "false");
 	}
 	else if (info == XLOG_END_OF_RECOVERY)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 94455b2..51b2efd 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -8,16 +8,17 @@
 #
 #-------------------------------------------------------------------------
 
 subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = clog.o commit_ts.o multixact.o parallel.o rmgr.o slru.o subtrans.o \
 	timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o \
+	fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
 # ensure that version checks in xlog.c get recompiled when catversion.h changes
 xlog.o: xlog.c $(top_srcdir)/src/include/catalog/catversion.h
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..9f315d9
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2024 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager. 
+ *
+ * This module manages the transactions involving foreign servers. 
+ *
+ * Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */ 
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine 		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u with user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(fdw_conn->serverid); 
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */ 
+	Oid				serveroid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */ 
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serveroid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serveroid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid foreign_server, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid,
+								Oid userid, int fdw_xact_info_len,
+								char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serveroid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */ 
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Prepare the transactions on the foreign servers, which can execute
+	 * two-phase-commit protocol.
+	 */
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/* 
+	 * Loop over the foreign connections 
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+															fdw_conn->userid,
+															&fdw_xact_info_len);
+		
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 * 
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+											fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact; 
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serveroid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid; 
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serveroid = fdw_xact->serveroid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serveroid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */ 
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.  Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.  The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */ 
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact 
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serveroid, Oid userid,
+					int fdw_xact_id_len, char *fdw_xact_id,
+					FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serveroid == serveroid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serveroid %u, userid %u found",
+						xid, serveroid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serveroid = serveroid;
+	fdw_xact->userid = userid;
+	fdw_xact->fdw_xact_status = fdw_xact_status; 
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */ 
+			fdw_remove_xlog.serveroid = fdw_xact->serveroid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serveroid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/* 
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *    according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *    succeeds, remove them from foreign transaction entries, otherwise unlock
+ *    them.
+ */ 
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+												is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/* 
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serveroid); 
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool 
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serveroid, fdw_xact->userid,
+								is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+	
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid     | dbid    | serverid | userid  | search for entry with
+ * invalid | invalid | invalid  | invalid | nothing
+ * invalid | invalid | invalid  | valid   | given userid
+ * invalid | invalid | valid    | invalid | given serverid
+ * invalid | invalid | valid    | valid   | given serverid and userid
+ * invalid | valid   | invalid  | invalid | given dbid
+ * invalid | valid   | invalid  | valid   | given dbid and userid
+ * invalid | valid   | valid    | invalid | given dbid and serverid
+ * invalid | valid   | valid    | valid   | given dbid, servroid and userid
+ * valid   | invalid | invalid  | invalid | given xid
+ * valid   | invalid | invalid  | valid   | given xid and userid
+ * valid   | invalid | valid    | invalid | given xid, serverid
+ * valid   | invalid | valid    | valid   | given xid, serverid, userid
+ * valid   | valid   | invalid  | invalid | given xid and dbid 
+ * valid   | valid   | invalid  | valid   | given xid, dbid and userid
+ * valid   | valid   | valid    | invalid | given xid, dbid, serverid
+ * valid   | valid   | valid    | valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+		
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serveroid)
+			entry_matches = false;
+		
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+	
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+		
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	  		*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+		
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serveroid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+	
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+	
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+		
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serveroid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serveroids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serveroids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serveroids[nxacts] = fdw_xact->serveroid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serveroids[cnt], userids[cnt]);
+			
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serveroids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serveroids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+	
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+	
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+	
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+			
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+	
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+	
+	
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+	
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+		
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serveroid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *    foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									PG_GETARG_TRANSACTIONID(XID_ARGNUM);
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+	
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serveroid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serveroid != serveroid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+	
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *    conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+	
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR		  		*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serveroid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serveroid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serveroid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serveroid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serveroid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										serveroid, userid,
+										fdw_xact_file_data->fdw_xact_id_len,
+										fdw_xact_file_data->fdw_xact_id,
+										FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */	
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+	
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serveroid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serveroid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..cdbc583 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -7,20 +7,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/brin_xlog.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 8c47e0f..e7c1199 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -35,20 +35,21 @@
  */
 #include "postgres.h"
 
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/pg_type.h"
@@ -1471,20 +1472,26 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 		RelationCacheInitFilePostInvalidate();
 
 	/* And now do the callbacks */
 	if (isCommit)
 		ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
 	else
 		ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
 	/*
 	 * And now we can clean up our mess.
 	 */
 	RemoveTwoPhaseFile(xid, true);
 
 	RemoveGXact(gxact);
 	MyLockedGxact = NULL;
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 47312f6..47a2a9b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -14,20 +14,21 @@
  *
  *-------------------------------------------------------------------------
  */
 
 #include "postgres.h"
 
 #include <time.h>
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -179,20 +180,24 @@ typedef struct TransactionStateData
 	TransactionId *childXids;	/* subcommitted child XIDs, in XID order */
 	int			nChildXids;		/* # of subcommitted child XIDs */
 	int			maxChildXids;	/* allocated size of childXids[] */
 	Oid			prevUser;		/* previous CurrentUserId setting */
 	int			prevSecContext; /* previous SecurityRestrictionContext */
 	bool		prevXactReadOnly;		/* entry-time xact r/o state */
 	bool		startedInRecovery;		/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
 
 /*
  * CurrentTransactionState always points to the current transaction state
  * block.  It will point to TopTransactionStateData when not in a
  * transaction at all, or when in a top-level transaction.
  */
 static TransactionStateData TopTransactionStateData = {
@@ -1896,20 +1901,23 @@ StartTransaction(void)
 	/* SecurityRestrictionContext should never be set outside a transaction */
 	Assert(s->prevSecContext == 0);
 
 	/*
 	 * initialize other subsystems for new transaction
 	 */
 	AtStart_GUC();
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
 	 */
 	s->state = TRANS_INPROGRESS;
 
 	ShowTransactionState("StartTransaction");
 }
 
 
@@ -1956,20 +1964,23 @@ CommitTransaction(void)
 
 		/*
 		 * Close open portals (converting holdable ones into static portals).
 		 * If there weren't any, we are done ... otherwise loop back to check
 		 * if they queued deferred triggers.  Lather, rinse, repeat.
 		 */
 		if (!PreCommit_Portals(false))
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
 	/*
 	 * The remaining actions cannot call any user-defined code, so it's safe
 	 * to start shutting down within-transaction services.  But note that most
 	 * of this stuff could still throw an error, which would switch us into
 	 * the transaction-abort path.
 	 */
 
@@ -2113,20 +2124,21 @@ CommitTransaction(void)
 	AtEOXact_GUC(true, 1);
 	AtEOXact_SPI(true);
 	AtEOXact_on_commit_actions(true);
 	AtEOXact_Namespace(true, is_parallel_worker);
 	AtEOXact_SMgr();
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
 	ResourceOwnerDelete(TopTransactionResourceOwner);
 	s->curTransactionOwner = NULL;
 	CurTransactionResourceOwner = NULL;
 	TopTransactionResourceOwner = NULL;
 
 	AtCommit_Memory();
 
@@ -2297,20 +2309,21 @@ PrepareTransaction(void)
 	 * before or after releasing the transaction's locks.
 	 */
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
 	AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
 	 *
 	 * We have to record transaction prepares even if we didn't make any
 	 * updates, because the transaction manager might get confused if we lose
 	 * a global transaction.
 	 */
 	EndPrepare(gxact);
 
@@ -2579,20 +2592,21 @@ AbortTransaction(void)
 
 		AtEOXact_GUC(false, 1);
 		AtEOXact_SPI(false);
 		AtEOXact_on_commit_actions(false);
 		AtEOXact_Namespace(false, is_parallel_worker);
 		AtEOXact_SMgr();
 		AtEOXact_Files();
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
 	RESUME_INTERRUPTS();
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 08d1682..d9cd8cd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -16,20 +16,21 @@
 
 #include <ctype.h>
 #include <time.h>
 #include <fcntl.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
 #include "access/timeline.h"
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -4872,20 +4873,21 @@ BootStrapXLOG(void)
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
 	WriteControlFile();
 
 	/* Bootstrap the commit log, too */
 	BootStrapCLOG();
 	BootStrapCommitTs();
 	BootStrapSUBTRANS();
 	BootStrapMultiXact();
@@ -5865,20 +5867,23 @@ CheckRequiredParameterValues(void)
 									 ControlFile->MaxConnections);
 		RecoveryRequiresIntParameter("max_worker_processes",
 									 max_worker_processes,
 									 ControlFile->max_worker_processes);
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
 /*
  * This must be called ONCE during postmaster or standalone-backend startup
  */
 void
 StartupXLOG(void)
 {
 	XLogCtlInsert *Insert;
@@ -6546,21 +6551,24 @@ StartupXLOG(void)
 		{
 			TransactionId *xids;
 			int			nxids;
 
 			ereport(DEBUG1,
 					(errmsg("initializing for hot standby")));
 
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
 
 			/* Tell procarray about the range of xids it has to deal with */
 			ProcArrayInitRecovery(ShmemVariableCache->nextXid);
 
 			/*
 			 * Startup commit log, commit timestamp and subtrans only.
 			 * MultiXact has already been started up and other SLRUs are not
@@ -7146,20 +7154,21 @@ StartupXLOG(void)
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
 	XLogCtl->LogwrtResult = LogwrtResult;
 
 	XLogCtl->LogwrtRqst.Write = EndOfLog;
 	XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
 	 * record before resource manager writes cleanup WAL records or checkpoint
 	 * record is written.
 	 */
 	Insert->fullPageWrites = lastFullPageWrites;
 	LocalSetXLogInsertAllowed();
 	UpdateFullPageWrites();
 	LocalXLogInsertAllowed = -1;
@@ -7343,20 +7352,26 @@ StartupXLOG(void)
 	/*
 	 * Perform end of recovery actions for any SLRUs that need it.
 	 */
 	TrimCLOG();
 	TrimMultiXact();
 
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
 	if (standbyState != STANDBY_DISABLED)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
 	if (readFile >= 0)
 	{
 		close(readFile);
@@ -8606,20 +8621,25 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointMultiXact();
 	CheckPointPredicate();
 	CheckPointRelationMap();
 	CheckPointReplicationSlots();
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
  * Save a checkpoint for recovery restart if appropriate
  *
  * This function is called each time a checkpoint record is read from XLOG.
  * It must determine whether the checkpoint represents a safe restartpoint or
  * not.  If so, the checkpoint record is stashed in shared memory so that
  * CreateRestartPoint can consult it.  (Note that the latter function is
  * executed by the checkpointer, while this one will be executed by the
@@ -9016,56 +9036,59 @@ XLogRestorePoint(const char *rpName)
  */
 static void
 XLogReportParameters(void)
 {
 	if (wal_level != ControlFile->wal_level ||
 		wal_log_hints != ControlFile->wal_log_hints ||
 		MaxConnections != ControlFile->MaxConnections ||
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
 		 * if archiving is not enabled, as you can't start archive recovery
 		 * with wal_level=minimal anyway. We don't really care about the
 		 * values in pg_control either if wal_level=minimal, but seems better
 		 * to keep them up-to-date to avoid confusion.
 		 */
 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
 		{
 			xl_parameter_change xlrec;
 			XLogRecPtr	recptr;
 
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
 			recptr = XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE);
 			XLogFlush(recptr);
 		}
 
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
 
 /*
  * Update full_page_writes in shared memory, and write an
  * XLOG_FPW_CHANGE record if necessary.
  *
  * Note: this function assumes there is no other process running
  * concurrently that could update it.
@@ -9240,20 +9263,21 @@ xlog_redo(XLogReaderState *record)
 		 */
 		if (standbyState >= STANDBY_INITIALIZED)
 		{
 			TransactionId *xids;
 			int			nxids;
 			TransactionId oldestActiveXID;
 			TransactionId latestCompletedXid;
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
 			 * down server, with only prepared transactions still alive. We're
 			 * never overflowed at this point because all subxids are listed
 			 * with their parent prepared transactions.
 			 */
 			running.xcnt = nxids;
 			running.subxcnt = 0;
 			running.subxid_overflow = false;
@@ -9432,20 +9456,21 @@ xlog_redo(XLogReaderState *record)
 		/* Update our copy of the parameters in pg_control */
 		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_parameter_change));
 
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
 		 * recover back up to this point before allowing hot standby again.
 		 * This is particularly important if wal_level was set to 'archive'
 		 * before, and is now 'hot_standby', to ensure you don't run queries
 		 * against the WAL preceding the wal_level change. Same applies to
 		 * decreasing max_* settings.
 		 */
 		minRecoveryPoint = ControlFile->minRecoveryPoint;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 95d6c14..3100f50 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -11,20 +11,21 @@
  *	  src/backend/bootstrap/bootstrap.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
 #include <unistd.h>
 #include <signal.h>
 
 #include "access/htup_details.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "pg_getopt.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ccc030f..4691e66 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -236,20 +236,29 @@ CREATE VIEW pg_available_extension_versions AS
            LEFT JOIN pg_extension AS X
              ON E.name = X.extname AND E.version = X.extversion;
 
 CREATE VIEW pg_prepared_xacts AS
     SELECT P.transaction, P.gid, P.prepared,
            U.rolname AS owner, D.datname AS database
     FROM pg_prepared_xact() AS P
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier" 
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
 	CASE WHEN rel.relkind = 'r' THEN 'table'::text
 		 WHEN rel.relkind = 'v' THEN 'view'::text
 		 WHEN rel.relkind = 'm' THEN 'materialized view'::text
 		 WHEN rel.relkind = 'S' THEN 'sequence'::text
@@ -933,10 +942,18 @@ LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'make_interval';
 
 CREATE OR REPLACE FUNCTION
   jsonb_set(jsonb_in jsonb, path text[] , replacement jsonb,
             create_if_missing boolean DEFAULT true)
 RETURNS jsonb
 LANGUAGE INTERNAL
 STRICT IMMUTABLE
 AS 'jsonb_set';
+
+CREATE OR REPLACE FUNCTION
+  pg_fdw_remove(transaction xid DEFAULT NULL, dbid oid DEFAULT NULL,
+				serverid oid DEFAULT NULL, userid oid DEFAULT NULL)
+RETURNS void
+LANGUAGE INTERNAL
+VOLATILE
+AS 'pg_fdw_remove';
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index cc912b2..3408252 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -6,20 +6,21 @@
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  *
  *
  * IDENTIFICATION
  *	  src/backend/commands/foreigncmds.c
  *
  *-------------------------------------------------------------------------
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
 #include "access/xact.h"
 #include "catalog/dependency.h"
 #include "catalog/indexing.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_foreign_data_wrapper.h"
 #include "catalog/pg_foreign_server.h"
 #include "catalog/pg_foreign_table.h"
@@ -1080,20 +1081,34 @@ RemoveForeignServerById(Oid srvId)
 	HeapTuple	tp;
 	Relation	rel;
 
 	rel = heap_open(ForeignServerRelationId, RowExclusiveLock);
 
 	tp = SearchSysCache1(FOREIGNSERVEROID, ObjectIdGetDatum(srvId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
 
 	heap_close(rel, RowExclusiveLock);
 }
 
 
 /*
  * Common routine to check permission for user-mapping-related DDL
@@ -1252,20 +1267,21 @@ AlterUserMapping(AlterUserMappingStmt *stmt)
 
 	umId = GetSysCacheOid2(USERMAPPINGUSERSERVER,
 						   ObjectIdGetDatum(useId),
 						   ObjectIdGetDatum(srv->serverid));
 	if (!OidIsValid(umId))
 		ereport(ERROR,
 				(errcode(ERRCODE_UNDEFINED_OBJECT),
 				 errmsg("user mapping \"%s\" does not exist for the server",
 						MappingUserName(useId))));
 
+
 	user_mapping_ddl_aclcheck(useId, srv->serverid, stmt->servername);
 
 	tp = SearchSysCacheCopy1(USERMAPPINGOID, ObjectIdGetDatum(umId));
 
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for user mapping %u", umId);
 
 	memset(repl_val, 0, sizeof(repl_val));
 	memset(repl_null, false, sizeof(repl_null));
 	memset(repl_repl, false, sizeof(repl_repl));
@@ -1378,20 +1394,31 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 		/* IF EXISTS specified, just note it */
 		ereport(NOTICE,
 		(errmsg("user mapping \"%s\" does not exist for the server, skipping",
 				MappingUserName(useId))));
 		return InvalidOid;
 	}
 
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
 	object.objectId = umId;
 	object.objectSubId = 0;
 
 	performDeletion(&object, DROP_CASCADE, 0);
 
 	return umId;
 }
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 90c2f4a..f82a537 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -86,20 +86,21 @@
 #ifdef USE_BONJOUR
 #include <dns_sd.h>
 #endif
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
 #include <pthread.h>
 #endif
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdw_xact.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "lib/ilist.h"
 #include "libpq/auth.h"
 #include "libpq/ip.h"
 #include "libpq/libpq.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pg_getopt.h"
 #include "pgstat.h"
@@ -2541,21 +2542,20 @@ pmdie(SIGNAL_ARGS)
 							   BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER);
 				/* and the autovac launcher too */
 				if (AutoVacPID != 0)
 					signal_child(AutoVacPID, SIGTERM);
 				/* and the bgwriter too */
 				if (BgWriterPID != 0)
 					signal_child(BgWriterPID, SIGTERM);
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
-
 				/*
 				 * If we're in recovery, we can't kill the startup process
 				 * right away, because at present doing so does not release
 				 * its locks.  We might want to change this in a future
 				 * release.  For the time being, the PM_WAIT_READONLY state
 				 * indicates that we're waiting for the regular (read only)
 				 * backends to die off; once they do, we'll kill the startup
 				 * and walreceiver processes.
 				 */
 				pmState = (pmState == PM_RUN) ?
@@ -5705,20 +5705,21 @@ PostmasterMarkPIDForWorkerNotify(int pid)
 
 	dlist_foreach(iter, &BackendList)
 	{
 		bp = dlist_container(Backend, elem, iter.cur);
 		if (bp->pid == pid)
 		{
 			bp->bgworker_notify = true;
 			return true;
 		}
 	}
+
 	return false;
 }
 
 #ifdef EXEC_BACKEND
 
 /*
  * The following need to be available to the save/restore_backend_variables
  * functions.  They are marked NON_EXEC_STATIC in their home modules.
  */
 extern slock_t *ShmemLock;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index c629da3..6fdd818 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -127,20 +127,21 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_MULTIXACT_ID:
 		case RM_RELMAP_ID:
 		case RM_BTREE_ID:
 		case RM_HASH_ID:
 		case RM_GIN_ID:
 		case RM_GIST_ID:
 		case RM_SEQ_ID:
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 			break;
 		case RM_NEXT_ID:
 			elog(ERROR, "unexpected RM_NEXT_ID rmgr_id: %u", (RmgrIds) XLogRecGetRmid(buf.record));
 	}
 }
 
 /*
  * Handle rmgr XLOG_ID records for DecodeRecordIntoReorderBuffer().
  */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 32ac58f..a790e5b 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -14,20 +14,21 @@
  */
 #include "postgres.h"
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/fdw_xact.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/postmaster.h"
 #include "replication/slot.h"
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
@@ -132,20 +133,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcSignalShmemSize());
 		size = add_size(size, CheckpointerShmemSize());
 		size = add_size(size, AutoVacuumShmemSize());
 		size = add_size(size, ReplicationSlotsShmemSize());
 		size = add_size(size, ReplicationOriginShmemSize());
 		size = add_size(size, WalSndShmemSize());
 		size = add_size(size, WalRcvShmemSize());
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
 
 		/* freeze the addin request size and include it */
 		addin_request_allowed = false;
 		size = add_size(size, total_addin_request);
 
 		/* might as well round it off to a multiple of a typical page size */
 		size = add_size(size, 8192 - (size % 8192));
@@ -243,20 +245,21 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	ReplicationOriginShmemInit();
 	WalSndShmemInit();
 	WalRcvShmemInit();
 
 	/*
 	 * Set up other modules that need some shared memory space
 	 */
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
 	/*
 	 * Alloc the win32 shared backend array
 	 */
 	if (!IsUnderPostmaster)
 		ShmemBackendArrayAllocation();
 #endif
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index c557cb6..d0f1472 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -1,14 +1,12 @@
 # Some commonly-used locks have predefined positions within MainLWLockArray;
-# these are defined here.  If you add a lock, add it to the end to avoid
-# renumbering the existing locks; if you remove a lock, consider leaving a gap
-# in the numbering sequence for the benefit of DTrace and other external
+# these are defined here.  If you add a lock, add it to the end to avoid # renumbering the existing locks; if you remove a lock, consider leaving a gap # in the numbering sequence for the benefit of DTrace and other external
 # debugging scripts.
 
 # 0 is available; was formerly BufFreelistLock
 ShmemIndexLock						1
 OidGenLock							2
 XidGenLock							3
 ProcArrayLock						4
 SInvalReadLock						5
 SInvalWriteLock						6
 WALBufMappingLock					7
@@ -39,10 +37,11 @@ OldSerXidLock						31
 SyncRepLock							32
 BackgroundWorkerLock				33
 DynamicSharedMemoryControlLock		34
 AutoFileLock						35
 ReplicationSlotAllocationLock		36
 ReplicationSlotControlLock			37
 CommitTsControlLock					38
 CommitTsLock						39
 ReplicationOriginLock				40
 MultiXactTruncationLock				41
+FDWXactLock							42
diff --git a/src/backend/utils/adt/xid.c b/src/backend/utils/adt/xid.c
index 6b61765..d6cba87 100644
--- a/src/backend/utils/adt/xid.c
+++ b/src/backend/utils/adt/xid.c
@@ -15,21 +15,20 @@
 #include "postgres.h"
 
 #include <limits.h>
 
 #include "access/multixact.h"
 #include "access/transam.h"
 #include "access/xact.h"
 #include "libpq/pqformat.h"
 #include "utils/builtins.h"
 
-#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 #define PG_RETURN_TRANSACTIONID(x)	return TransactionIdGetDatum(x)
 
 #define PG_GETARG_COMMANDID(n)		DatumGetCommandId(PG_GETARG_DATUM(n))
 #define PG_RETURN_COMMANDID(x)		return CommandIdGetDatum(x)
 
 
 Datum
 xidin(PG_FUNCTION_ARGS)
 {
 	char	   *str = PG_GETARG_CSTRING(0);
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fda0fb9..1fe94bb 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -20,20 +20,21 @@
 #include <float.h>
 #include <math.h>
 #include <limits.h>
 #include <unistd.h>
 #include <sys/stat.h>
 #ifdef HAVE_SYSLOG
 #include <syslog.h>
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "commands/async.h"
 #include "commands/prepare.h"
 #include "commands/vacuum.h"
 #include "commands/variable.h"
 #include "commands/trigger.h"
@@ -1999,20 +2000,33 @@ static struct config_int ConfigureNamesInt[] =
 	{
 		{"max_prepared_transactions", PGC_POSTMASTER, RESOURCES_MEM,
 			gettext_noop("Sets the maximum number of simultaneously prepared transactions."),
 			NULL
 		},
 		&max_prepared_xacts,
 		0, 0, MAX_BACKENDS,
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
 			gettext_noop("Sets the minimum OID of tables for tracking locks."),
 			gettext_noop("Is used to avoid output on system tables."),
 			GUC_NOT_IN_SAMPLE
 		},
 		&Trace_lock_oidmin,
 		FirstNormalObjectId, 0, INT_MAX,
 		NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index dcf929f..f7df0e2 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -116,20 +116,26 @@
 					# (change requires restart)
 #huge_pages = try			# on, off, or try
 					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
 # Note:  Increasing max_prepared_transactions costs ~600 bytes of shared memory
 # per transaction slot, plus lock space (see max_locks_per_transaction).
 # It is not advisable to set max_prepared_transactions nonzero unless you
 # actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0		# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature. 
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
 					# supported by the operating system:
 					#   posix
 					#   sysv
 					#   windows
 					#   mmap
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index feeff9e..47ecf1e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -192,31 +192,32 @@ static const char *subdirs[] = {
 	"pg_xlog",
 	"pg_xlog/archive_status",
 	"pg_clog",
 	"pg_commit_ts",
 	"pg_dynshmem",
 	"pg_notify",
 	"pg_serial",
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
 	"pg_stat_tmp",
 	"pg_logical",
 	"pg_logical/snapshots",
-	"pg_logical/mappings"
+	"pg_logical/mappings",
 };
 
 
 /* path to 'initdb' binary directory */
 static char bin_path[MAXPGPATH];
 static char backend_exec[MAXPGPATH];
 
 static char **replace_token(char **lines,
 			  const char *token, const char *replacement);
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 32e1d81..8e4cf86 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -327,12 +327,14 @@ main(int argc, char *argv[])
 	printf(_("Size of a large-object chunk:         %u\n"),
 		   ControlFile.loblksize);
 	printf(_("Date/time type storage:               %s\n"),
 		   (ControlFile.enableIntTimes ? _("64-bit integers") : _("floating-point numbers")));
 	printf(_("Float4 argument passing:              %s\n"),
 		   (ControlFile.float4ByVal ? _("by value") : _("by reference")));
 	printf(_("Float8 argument passing:              %s\n"),
 		   (ControlFile.float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile.data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile.max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index d7ac2ba..1e3b4a2 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -579,20 +579,21 @@ GuessControlValues(void)
 	ControlFile.unloggedLSN = 1;
 
 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
 
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
 	ControlFile.floatFormat = FLOATFORMAT_VALUE;
 	ControlFile.blcksz = BLCKSZ;
 	ControlFile.relseg_size = RELSEG_SIZE;
 	ControlFile.xlog_blcksz = XLOG_BLCKSZ;
 	ControlFile.xlog_seg_size = XLOG_SEG_SIZE;
 	ControlFile.nameDataLen = NAMEDATALEN;
 	ControlFile.indexMaxKeys = INDEX_MAX_KEYS;
@@ -795,20 +796,21 @@ RewriteControlFile(void)
 	 * Force the defaults for max_* settings. The values don't really matter
 	 * as long as wal_level='minimal'; the postmaster will reset these fields
 	 * anyway at startup.
 	 */
 	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
 	ControlFile.wal_log_hints = false;
 	ControlFile.track_commit_timestamp = false;
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
 	ControlFile.xlog_seg_size = XLogSegSize;
 
 	/* Contents are protected with a CRC */
 	INIT_CRC32C(ControlFile.crc);
 	COMP_CRC32C(ControlFile.crc,
 				(char *) &ControlFile,
 				offsetof(ControlFileData, crc));
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 5b88a8d..82c6b51 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -14,20 +14,21 @@
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/hash.h"
 #include "access/heapam_xlog.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
 #include "access/rmgr.h"
 #include "access/spgist.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
+#include "access/fdw_xact.h"
 #include "catalog/storage_xlog.h"
 #include "commands/dbcommands_xlog.h"
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include "replication/origin.h"
 #include "rmgrdesc.h"
 #include "storage/standby.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup) \
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..664de7e
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,73 @@
+/*
+ * fdw_xact.h 
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H 
+#define FDW_XACT_H 
+
+#include "storage/backendid.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serveroid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serveroid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serveroid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index c083216..7272c33 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -37,11 +37,12 @@ PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify,
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL)
 PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL)
 PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup)
 PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index cb1c2db..d614ab6 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -296,20 +296,21 @@ typedef struct xl_xact_parsed_abort
 	RelFileNode *xnodes;
 
 	TransactionId twophase_xid; /* only for 2PC */
 } xl_xact_parsed_abort;
 
 
 /* ----------------
  *		extern definitions
  * ----------------
  */
+#define PG_GETARG_TRANSACTIONID(n)	DatumGetTransactionId(PG_GETARG_DATUM(n))
 extern bool IsTransactionState(void);
 extern bool IsAbortedTransactionBlockState(void);
 extern TransactionId GetTopTransactionId(void);
 extern TransactionId GetTopTransactionIdIfAny(void);
 extern TransactionId GetCurrentTransactionId(void);
 extern TransactionId GetCurrentTransactionIdIfAny(void);
 extern TransactionId GetStableLatestTransactionId(void);
 extern SubTransactionId GetCurrentSubTransactionId(void);
 extern void MarkCurrentTransactionIdLoggedIfAny(void);
 extern bool SubTransactionIsActive(SubTransactionId subxid);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 86b532d..9ce64bd 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -206,20 +206,21 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 
 /*
  * Information logged when we detect a change in one of the parameters
  * important for Hot Standby.
  */
 typedef struct xl_parameter_change
 {
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
 	bool		track_commit_timestamp;
 } xl_parameter_change;
 
 /* logs restore point */
 typedef struct xl_restore_point
 {
 	TimestampTz rp_time;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..d168c32 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -173,20 +173,21 @@ typedef struct ControlFileData
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
 	 * or hot standby.
 	 */
 	int			wal_level;
 	bool		wal_log_hints;
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
 	/*
 	 * This data is used to check for hardware-architecture compatibility of
 	 * the database and the backend executable.  We need not check endianness
 	 * explicitly, since the pg_control version will surely look wrong to a
 	 * machine of different endianness, but we do need to worry about MAXALIGN
 	 * and floating-point format.  (Note: storage layout nominally also
 	 * depends on SHORTALIGN and INTALIGN, but in practice these are the same
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index f688454..00a119a 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5268,20 +5268,26 @@ DESCR("fractional rank of hypothetical row");
 DATA(insert OID = 3989 ( percent_rank_final PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_percent_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3990 ( cume_dist			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s 1 0 701 "2276" "{2276}" "{v}" _null_ _null_ _null_ aggregate_dummy _null_ _null_ _null_ ));
 DESCR("cumulative distribution of hypothetical row");
 DATA(insert OID = 3991 ( cume_dist_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 701 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_cume_dist_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
 DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s 1 0 20 "2276" "{2276}" "{v}" _null_ _null_ _null_	aggregate_dummy _null_ _null_ _null_ ));
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_	hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4066 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4083 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4099 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v s 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3584 ( binary_upgrade_set_next_array_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v s 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_array_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3585 ( binary_upgrade_set_next_toast_pg_type_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v s 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_toast_pg_type_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
 DATA(insert OID = 3586 ( binary_upgrade_set_next_heap_pg_class_oid PGNSP PGUID	12 1 0 0 0 f f f f t f v s 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_heap_pg_class_oid _null_ _null_ _null_ ));
 DESCR("for use by pg_upgrade");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 69b48b4..d1ddb4e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -5,20 +5,21 @@
  *
  * Copyright (c) 2010-2015, PostgreSQL Global Development Group
  *
  * src/include/foreign/fdwapi.h
  *
  *-------------------------------------------------------------------------
  */
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/xact.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
 /* To avoid including explain.h here, reference ExplainState thus: */
 struct ExplainState;
 
 
 /*
  * Callback function signatures --- see fdwhandler.sgml for more info.
  */
@@ -110,20 +111,32 @@ typedef int (*AcquireSampleRowsFunc) (Relation relation, int elevel,
 											   HeapTuple *rows, int targrows,
 												  double *totalrows,
 												  double *totaldeadrows);
 
 typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 												 AcquireSampleRowsFunc *func,
 													BlockNumber *totalpages);
 
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
+typedef bool (*EndForeignTransaction_function) (Oid serverOid, Oid userid,
+													bool is_commit);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverOid, Oid userid,
+														int prep_info_len,
+														char *prep_info);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverOid, Oid userid,
+															bool is_commit,
+														int prep_info_len,
+														char *prep_info);
+typedef char *(*GetPrepareId_function) (Oid serverOid, Oid userid,
+														int *prep_info_len);
+
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
  * planner and executor.
  *
  * More function pointers are likely to be added in the future.  Therefore
  * it's recommended that the handler initialize the struct with
  * makeNode(FdwRoutine) so that all fields are set to NULL.  This will
  * ensure that no fields are accidentally left undefined.
@@ -165,20 +178,26 @@ typedef struct FdwRoutine
 
 	/* Support functions for EXPLAIN */
 	ExplainForeignScan_function ExplainForeignScan;
 	ExplainForeignModify_function ExplainForeignModify;
 
 	/* Support functions for ANALYZE */
 	AnalyzeForeignTable_function AnalyzeForeignTable;
 
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
+
+	/* Support functions for foreign transactions */
+	GetPrepareId_function				GetPrepareId;
+	EndForeignTransaction_function		EndForeignTransaction;
+	PrepareForeignTransaction_function	PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
 } FdwRoutine;
 
 
 /* Functions in foreign/foreign.c */
 extern FdwRoutine *GetFdwRoutine(Oid fdwhandler);
 extern Oid	GetForeignServerIdByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineByServerId(Oid serverid);
 extern FdwRoutine *GetFdwRoutineByRelId(Oid relid);
 extern FdwRoutine *GetFdwRoutineForRelation(Relation relation, bool makecopy);
 extern bool IsImportableForeignTable(const char *tablename,
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 3d68017..7458d5b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -223,25 +223,26 @@ typedef struct PROC_HDR
 } PROC_HDR;
 
 extern PROC_HDR *ProcGlobal;
 
 extern PGPROC *PreparedXactProcs;
 
 /*
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
- * Startup process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 4 slots.
+ * Background writer, checkpointer, WAL writer and foreign transaction resolver
+ * run during normal operation. Startup process and WAL receiver also consume 2
+ * slots, but WAL writer is launched only after startup has exited, so we only
+ * need 5 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
 extern int	DeadlockTimeout;
 extern int	StatementTimeout;
 extern int	LockTimeout;
 extern bool log_lock_waits;
 
 
 /*
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index c193e44..56df3a5 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1258,11 +1258,15 @@ extern Datum pg_available_extensions(PG_FUNCTION_ARGS);
 extern Datum pg_available_extension_versions(PG_FUNCTION_ARGS);
 extern Datum pg_extension_update_paths(PG_FUNCTION_ARGS);
 extern Datum pg_extension_config_dump(PG_FUNCTION_ARGS);
 
 /* commands/prepare.c */
 extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 80374e4..e95334d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1305,20 +1305,30 @@ pg_available_extensions| SELECT e.name,
     e.comment
    FROM (pg_available_extensions() e(name, default_version, comment)
      LEFT JOIN pg_extension x ON ((e.name = x.extname)));
 pg_cursors| SELECT c.name,
     c.statement,
     c.is_holdable,
     c.is_binary,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
     a.name,
     a.setting,
     a.applied,
     a.error
    FROM pg_show_all_file_settings() a(sourcefile, sourceline, seqno, name, setting, applied, error);
 pg_group| SELECT pg_authid.rolname AS groname,
     pg_authid.oid AS grosysid,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index dd65ab5..3c23446 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2224,37 +2224,40 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		if (system(buf))
 		{
 			fprintf(stderr, _("\n%s: initdb failed\nExamine %s/log/initdb.log for the reason.\nCommand was: %s\n"), progname, outputdir, buf);
 			exit(2);
 		}
 
 		/*
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
 		if (pg_conf == NULL)
 		{
 			fprintf(stderr, _("\n%s: could not open \"%s\" for adding extra config: %s\n"), progname, buf, strerror(errno));
 			exit(2);
 		}
 		fputs("\n# Configuration added by pg_regress\n\n", pg_conf);
 		fputs("log_autovacuum_min_duration = 0\n", pg_conf);
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		if (temp_config != NULL)
 		{
 			FILE	   *extra_conf;
 			char		line_buf[1024];
 
 			extra_conf = fopen(temp_config, "r");
 			if (extra_conf == NULL)
 			{
 				fprintf(stderr, _("\n%s: could not open \"%s\" to read extra config: %s\n"), progname, temp_config, strerror(errno));

#51

Michael Paquier

michael.paquier@gmail.com

about 10 years ago

In reply to: Ashutosh Bapat (#50)

Re: Transactions involving multiple postgres foreign servers

On Mon, Nov 9, 2015 at 8:55 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's updated
patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

Ashutosh, others, this thread has been stalling for more than 1 month
and a half. There is a new patch that still applies (be careful of
whitespaces btw), but no reviews came in. So what should we do? I
would tend to move this patch to the next CF because of a lack of
reviews.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 10 years ago

In reply to: Michael Paquier (#51)

Re: Transactions involving multiple postgres foreign servers

On Thu, Dec 24, 2015 at 8:32 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

On Mon, Nov 9, 2015 at 8:55 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Nov 7, 2015 at 12:07 AM, Robert Haas <robertmhaas@gmail.com>

wrote:

On Wed, Aug 12, 2015 at 6:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The previous patch would not compile on the latest HEAD. Here's

updated

patch.

Perhaps unsurprisingly, this doesn't apply any more. But we have
bigger things to worry about.

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

Ashutosh, others, this thread has been stalling for more than 1 month
and a half. There is a new patch that still applies (be careful of
whitespaces btw), but no reviews came in. So what should we do? I
would tend to move this patch to the next CF because of a lack of
reviews.

Yes, that would help. Thanks.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#53

Michael Paquier

michael.paquier@gmail.com

about 10 years ago

In reply to: Ashutosh Bapat (#52)

Re: Transactions involving multiple postgres foreign servers

On Thu, Dec 24, 2015 at 7:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Dec 24, 2015 at 8:32 AM, Michael Paquier <michael.paquier@gmail.com>

Ashutosh, others, this thread has been stalling for more than 1 month
and a half. There is a new patch that still applies (be careful of
whitespaces btw), but no reviews came in. So what should we do? I
would tend to move this patch to the next CF because of a lack of
reviews.

Yes, that would help. Thanks.

Done.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Ashutosh Bapat (#50)

Re: Transactions involving multiple postgres foreign servers

Ashutosh Bapat wrote:

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

Um, I tried this patch and it doesn't apply at all. There's a large
number of conflicts. Please update it and resubmit to the next
commitfest.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Alvaro Herrera

alvherre@2ndquadrant.com

almost 10 years ago

In reply to: Alvaro Herrera (#54)

Re: Transactions involving multiple postgres foreign servers

Alvaro Herrera wrote:

Ashutosh Bapat wrote:

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

Um, I tried this patch and it doesn't apply at all. There's a large
number of conflicts. Please update it and resubmit to the next
commitfest.

Also, please run "git show --check" of "git diff origin/master --check"
and fix the whitespace problems that it shows. It's an easy thing but
there's a lot of red squares in my screen.

--
ï¿½lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56

Vinayak Pokale

vinpokale@gmail.com

over 9 years ago

In reply to: Alvaro Herrera (#55)

Re: Transactions involving multiple postgres foreign servers

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the current
implementation in postgres_fdw doesn't make sure that either all of them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to contribute in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions involving
foreign servers. when the transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve in
the transaction.

Currently we can push some conditions down to shard nodes, especially in
9.6 the directly modify feature has
been introduced. But such a transaction modifying data on shard node is not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but transaction
manager probably can use this 2PC for FDW feature to manage distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Suggestions and comments are helpful to implement this feature.

Regards,

Vinayak Pokale

On Mon, Feb 1, 2016 at 11:14 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

Show quoted text

Alvaro Herrera wrote:

Ashutosh Bapat wrote:

Here's updated patch. I didn't use version numbers in file names in my
previous patches. I am starting from this onwards.

Um, I tried this patch and it doesn't apply at all. There's a large
number of conflicts. Please update it and resubmit to the next
commitfest.

Also, please run "git show --check" of "git diff origin/master --check"
and fix the whitespace problems that it shows. It's an easy thing but
there's a lot of red squares in my screen.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Vinayak Pokale (#56)

2 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com> wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the current
implementation in postgres_fdw doesn't make sure that either all of them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to contribute in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions involving
foreign servers. when the transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve in
the transaction.

Currently we can push some conditions down to shard nodes, especially in 9.6
the directly modify feature has
been introduced. But such a transaction modifying data on shard node is not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but transaction
manager probably can use this 2PC for FDW feature to manage distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

Feedback and suggestion are very welcome.

Regards,

--
Masahiko Sawada

Attachments:

0001-Support-transaction-with-foreign-servers.patchtext/plain; charset=US-ASCII; name=0001-Support-transaction-with-foreign-servers.patchDownload

From 6c438e39f625c9a88b4242fe605deaba43087744 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Mon, 8 Aug 2016 15:04:18 -0700
Subject: [PATCH 1/2] Support transaction with foreign servers.

---
 src/backend/access/rmgrdesc/Makefile          |    9 +-
 src/backend/access/rmgrdesc/fdw_xactdesc.c    |   61 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    5 +-
 src/backend/access/transam/Makefile           |    2 +-
 src/backend/access/transam/fdw_xact.c         | 2034 +++++++++++++++++++++++++
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |    7 +
 src/backend/access/transam/xact.c             |   14 +
 src/backend/access/transam/xlog.c             |   27 +-
 src/backend/bootstrap/bootstrap.c             |    1 +
 src/backend/commands/foreigncmds.c            |   26 +
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    3 +
 src/backend/storage/lmgr/lwlocknames.txt      |    1 +
 src/backend/utils/misc/guc.c                  |   14 +
 src/backend/utils/misc/postgresql.conf.sample |    6 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetxlog/pg_resetxlog.c           |    2 +
 src/bin/pg_xlogdump/rmgrdesc.c                |    2 +
 src/include/access/fdw_xact.h                 |   75 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.h                 |    6 +
 src/include/foreign/fdwapi.h                  |   24 +
 src/include/storage/proc.h                    |    5 +-
 src/include/utils/builtins.h                  |    4 +
 src/test/regress/pg_regress.c                 |   11 +-
 29 files changed, 2333 insertions(+), 14 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/fdw_xactdesc.c
 create mode 100644 src/backend/access/transam/fdw_xact.c
 create mode 100644 src/include/access/fdw_xact.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..b01ccf8
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 62ed1dc..c2f36c7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..df305e5
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2034 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Prepare the transactions on the foreign servers, which can execute
+	 * two-phase-commit protocol.
+	 */
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+													  fdw_conn->userid,
+													  &fdw_xact_info_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_info_len,
+									 fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serverid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.	 Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.	The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+	UserMapping		*user_mapping;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	user_mapping = GetUserMapping(userid, serverid);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = user_mapping->umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serverid);
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, servroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char			*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serverid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serverid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serverids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serverids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serverids[nxacts] = fdw_xact->serverid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serverids[cnt], userids[cnt]);
+
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serverids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serverids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 31c5fd1..159f9d9 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 9f55adc..2dd3df4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 23f36ea..e140e71 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -186,6 +187,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1921,6 +1926,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1981,6 +1989,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2138,6 +2149,7 @@ CommitTransaction(void)
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2322,6 +2334,7 @@ PrepareTransaction(void)
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
@@ -2608,6 +2621,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f13f9c1..2735eff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4905,6 +4906,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -5901,6 +5903,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transaction",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6582,7 +6587,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7192,6 +7200,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7384,6 +7393,12 @@ StartupXLOG(void)
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
@@ -8641,6 +8656,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9051,7 +9071,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9072,6 +9093,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9087,6 +9109,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9275,6 +9298,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9467,6 +9491,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e518e17..25126de 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index eb531af..9a10696 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..c0f000c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index c04b17f..74f10b7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -141,6 +142,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -253,6 +255,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f8996cd..6589cfe 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -47,3 +47,4 @@ CommitTsLock						39
 ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
+FDWXactLock					43
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c5178f7..ff21090 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2055,6 +2056,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 6d0666c..8a26264 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index a978bbc..ea4682d 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,6 +208,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 96619a2..90cceb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -296,5 +296,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 525b82b..c8cf4ce 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 017b9c5..edde3d5 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..87636de
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..86448ff 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 0a595cc..9a92ce7 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..3413201 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 6fed7a0..1f665a5 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5247,6 +5247,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4111 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..3383651 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,23 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													int prep_info_len, char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +237,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Supprot functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f576f05..f49334b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -251,11 +251,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 40e25c8..d047bd4 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1331,4 +1331,8 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 574f5b8..7d96d8d 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2233,9 +2233,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2249,7 +2251,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{
-- 
2.8.1

0002-Support-2PC-for-postgres_fdw.patchtext/plain; charset=US-ASCII; name=0002-Support-2PC-for-postgres_fdw.patchDownload

From e4f9500f41cff6d607abfbf601f7511a7652103b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 23 Aug 2016 16:56:09 -0700
Subject: [PATCH 2/2] Support 2PC for postgres_fdw.

---
 contrib/postgres_fdw/connection.c   | 466 ++++++++++++++++++++++++------------
 contrib/postgres_fdw/option.c       |   5 +-
 contrib/postgres_fdw/postgres_fdw.c |  22 +-
 contrib/postgres_fdw/postgres_fdw.h |  11 +-
 4 files changed, 348 insertions(+), 156 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8ca1c1c..28708ba 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "storage/latch.h"
@@ -63,16 +65,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -85,6 +90,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -93,7 +101,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -121,9 +130,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -158,7 +164,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -167,7 +186,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -177,9 +201,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -234,11 +261,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -369,15 +399,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -585,158 +622,270 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+	UserMapping 	*user_mapping = GetUserMapping(userid, serverid);
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = user_mapping->umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -835,3 +984,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 224aed9..6a20c47 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -176,6 +177,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 931bcfd..f585273 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -455,6 +457,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1298,7 +1306,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1679,7 +1687,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2276,7 +2284,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2538,7 +2546,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3398,7 +3406,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3490,7 +3498,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3713,7 +3721,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..ae2a40d 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -99,7 +100,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -160,6 +162,13 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid, int prep_info_len,
+											  char *prep_info);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
-- 
2.8.1

#58

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#57)

Re: Transactions involving multiple postgres foreign servers

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the

current

implementation in postgres_fdw doesn't make sure that either all of them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to contribute

in

it.

2PC for FDW
============
The patch provides support for atomic commit for transactions involving
foreign servers. when the transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve

in

the transaction.

Currently we can push some conditions down to shard nodes, especially in

9.6

the directly modify feature has
been introduced. But such a transaction modifying data on shard node is

not

executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we

almost

can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but transaction
manager probably can use this 2PC for FDW feature to manage distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing your work
on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Do you mean to say that your patches do not have pg_fdw_xact_resolver() and
documentation that my patches had?

you mean to say that my patches did not have (lacked)
pg_fdw_xact_resolver() and documenation

OR some combination of those?
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#59

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#58)

Re: Transactions involving multiple postgres foreign servers

On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the
current
implementation in postgres_fdw doesn't make sure that either all of them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to contribute
in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions involving
foreign servers. when the transaction makes changes to foreign servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can involve
in
the transaction.

Currently we can push some conditions down to shard nodes, especially in
9.6
the directly modify feature has
been introduced. But such a transaction modifying data on shard node is
not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we
almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but transaction
manager probably can use this 2PC for FDW feature to manage distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing your work
on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Sorry to confuse you.

Do you mean to say that your patches do not have pg_fdw_xact_resolver() and
documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#60

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#59)

Re: Transactions involving multiple postgres foreign servers

On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada <sawada.mshk@gmail.com

wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the
current
implementation in postgres_fdw doesn't make sure that either all of

them

commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to

contribute

in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions

involving

foreign servers. when the transaction makes changes to foreign

servers,

either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can

involve

in
the transaction.

Currently we can push some conditions down to shard nodes, especially

in

9.6
the directly modify feature has
been introduced. But such a transaction modifying data on shard node

is

not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we
almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but

transaction

manager probably can use this 2PC for FDW feature to manage

distributed

transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are

generic

features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead

of

COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing your

work

on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Sorry to confuse you.

Do you mean to say that your patches do not have pg_fdw_xact_resolver()

and

documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve
any transactions which can not be resolved immediately after they were
prepared. There was a comment from Kevin (IIRC) that leaving transactions
unresolved on the foreign server keeps the resources locked on those
servers. That's not a very good situation. And nobody but the initiating
server can resolve those. That functionality is important to make it a
complete 2PC solution. So, please consider it to be included in your first
set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#61

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#60)

Re: Transactions involving multiple postgres foreign servers

On Fri, Aug 26, 2016 at 3:13 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada
<sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic
commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the
current
implementation in postgres_fdw doesn't make sure that either all of
them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to
contribute
in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions
involving
foreign servers. when the transaction makes changes to foreign
servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can
involve
in
the transaction.

Currently we can push some conditions down to shard nodes, especially
in
9.6
the directly modify feature has
been introduced. But such a transaction modifying data on shard node
is
not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we
almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but
transaction
manager probably can use this 2PC for FDW feature to manage
distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are
generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead
of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing your
work
on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Sorry to confuse you.

Do you mean to say that your patches do not have pg_fdw_xact_resolver()
and
documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve
any transactions which can not be resolved immediately after they were
prepared. There was a comment from Kevin (IIRC) that leaving transactions
unresolved on the foreign server keeps the resources locked on those
servers. That's not a very good situation. And nobody but the initiating
server can resolve those. That functionality is important to make it a
complete 2PC solution. So, please consider it to be included in your first
set of patches.

Yeah, I know the reason why pg_fdw_xact_resolver is required.
I will add it as a separated patch.

Regards,

--
Masahiko Sawada

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

over 9 years ago

In reply to: Ashutosh Bapat (#60)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On 2016/08/26 15:13, Ashutosh Bapat wrote:

On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada
<sawada.mshk@gmail.com <mailto:sawada.mshk@gmail.com>> wrote:

On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com
<mailto:ashutosh.bapat@enterprisedb.com>> wrote:

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada

<sawada.mshk@gmail.com <mailto:sawada.mshk@gmail.com>>

wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale

<vinpokale@gmail.com <mailto:vinpokale@gmail.com>>

wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving

atomic commits

across multiple foreign servers.
If a transaction make changes to more than two foreign

servers the

current
implementation in postgres_fdw doesn't make sure that either

all of them

commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to

contribute

in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions

involving

foreign servers. when the transaction makes changes to

foreign servers,

either all the changes to all the foreign servers commit or

rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of

FDW that

support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc.

can involve

in
the transaction.

Currently we can push some conditions down to shard nodes,

especially in

9.6
the directly modify feature has
been introduced. But such a transaction modifying data on

shard node is

not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means

that we

almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several

shared node).

For multi master, we definitely need transaction manager but

transaction

manager probably can use this 2PC for FDW feature to manage

distributed

transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs

are generic

features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED

instead of

COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions

resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing

your work

on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised

it and

divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Sorry to confuse you.

Do you mean to say that your patches do not have

pg_fdw_xact_resolver() and

documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Thanks for the clarification. I had added pg_fdw_xact_resolver() to
resolve any transactions which can not be resolved immediately after
they were prepared. There was a comment from Kevin (IIRC) that leaving
transactions unresolved on the foreign server keeps the resources
locked on those servers. That's not a very good situation. And nobody
but the initiating server can resolve those. That functionality is
important to make it a complete 2PC solution. So, please consider it
to be included in your first set of patches.

The attached patch included pg_fdw_xact_resolver.

Regards,
Vinayak Pokale
NTT Open Source Software Center

Attachments:

0003-pg-fdw-xact-resolver.patchapplication/x-download; name=0003-pg-fdw-xact-resolver.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..6f587ae
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,364 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2015, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL;
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+			
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+	
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}

#63

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

over 9 years ago

In reply to: vinayak (#62)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On 2016/09/07 10:54, vinayak wrote:

Thanks for the clarification. I had added pg_fdw_xact_resolver() to
resolve any transactions which can not be resolved immediately after
they were prepared. There was a comment from Kevin (IIRC) that
leaving transactions unresolved on the foreign server keeps the
resources locked on those servers. That's not a very good situation.
And nobody but the initiating server can resolve those. That
functionality is important to make it a complete 2PC solution. So,
please consider it to be included in your first set of patches.

The attached patch included pg_fdw_xact_resolver.

The attached patch includes the documentation.

Regards,
Vinayak Pokale
NTT Open Source Software Center

Attachments:

0001-Support-transaction-with-foreign-servers.patchapplication/x-download; name=0001-Support-transaction-with-foreign-servers.patchDownload

commit 204dc713ef986a531065de5bf67491dbe6cddf97
Author: Vinayak Pokale <vinpokale@gmail.com>
Date:   Fri Sep 23 15:42:55 2016 +0900

    Support transaction with foreign servers.

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd66abc..7bbd2d4 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1418,6 +1418,48 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..0077e6e 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1701,4 +1701,86 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..b01ccf8
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 62ed1dc..c2f36c7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..df305e5
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2034 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreing_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreing_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	XLogRecPtr		fdw_xact_lsn;		/* LSN of the log record for inserting this entry */
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+										int fdw_xact_id_len, char *fdw_xact_id,
+										FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Prepare the transactions on the foreign servers, which can execute
+	 * two-phase-commit protocol.
+	 */
+	prepare_foreign_transactions();
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+													  fdw_conn->userid,
+													  &fdw_xact_info_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_info_len,
+									 fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_xact_info_len, fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+	char				path[MAXPGPATH];
+	int					fd;
+	pg_crc32c			fdw_xact_crc;
+	pg_crc32c			bogus_crc;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid,
+									fdw_xact_id_len, fdw_xact_id,
+									FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	FDWXactFilePath(path, fdw_xact->local_xid, fdw_xact->serverid,
+						fdw_xact->userid);
+
+	/* Create the file, but error out if it already exists. */
+	fd = OpenTransientFile(path, O_EXCL | O_CREAT | PG_BINARY | O_WRONLY,
+							S_IRUSR | S_IWUSR);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create foreign transaction state file \"%s\": %m",
+						path)));
+
+	/* Write data to file, and calculate CRC as we pass over it */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, fdw_xact_file_data, data_len);
+	if (write(fd, fdw_xact_file_data, data_len) != data_len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write FDW transaction state file: %s", path)));
+	}
+
+	FIN_CRC32C(fdw_xact_crc);
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write two-phase state file: %m")));
+	}
+
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in two-phase state file: %m")));
+	}
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.	 Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.	The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_lsn);
+
+	/* If we crash now WAL replay will fix things */
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+	UserMapping		*user_mapping;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	user_mapping = GetUserMapping(userid, serverid);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = user_mapping->umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_lsn = 0;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreing_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serverid);
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, servroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char			*rec = XLogRecGetData(record);
+	uint8			info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int				rec_len = XLogRecGetDataLen(record);
+	TransactionId	xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData	*fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char				path[MAXPGPATH];
+		int					fd;
+		pg_crc32c	fdw_xact_crc;
+
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serverid,
+							fdw_xact_data_file->userid);
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+								S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+						path)));
+
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		/* Remove the file from the disk. */
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serverid, fdw_remove_xlog->userid,
+								true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	Oid				*serverids;
+	TransactionId	*xids;
+	Oid				*userids;
+	Oid				*dbids;
+	int				nxacts;
+	int				cnt;
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * Collect the file paths which need to be synced. We might sync a file
+	 * again if it lives beyond the checkpoint boundaries. But this case is rare
+	 * and may not involve much I/O.
+	 */
+	xids = (TransactionId *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(TransactionId));
+	serverids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	userids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	dbids = (Oid *) palloc(FDWXactGlobal->num_fdw_xacts * sizeof(Oid));
+	nxacts = 0;
+
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		if (fdw_xact->fdw_xact_valid &&
+			fdw_xact->fdw_xact_lsn <= redo_horizon)
+		{
+			xids[nxacts] = fdw_xact->local_xid;
+			serverids[nxacts] = fdw_xact->serverid;
+			userids[nxacts] = fdw_xact->userid;
+			dbids[nxacts] = fdw_xact->dboid;
+			nxacts++;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	for (cnt = 0; cnt < nxacts; cnt++)
+	{
+		char	path[MAXPGPATH];
+		int		fd;
+
+		FDWXactFilePath(path, xids[cnt], serverids[cnt], userids[cnt]);
+
+		fd = OpenTransientFile(path, O_RDWR | PG_BINARY, 0);
+
+		if (fd < 0)
+		{
+			if (errno == ENOENT)
+			{
+				/* OK if we do not have the entry anymore */
+				if (!fdw_xact_exists(xids[cnt], dbids[cnt], serverids[cnt],
+										userids[cnt]))
+					continue;
+
+				/* Restore errno in case it was changed */
+				errno = ENOENT;
+			}
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not open foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file \"%s\": %m",
+							path)));
+		}
+
+		if (CloseTransientFile(fd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close foreign transaction state file \"%s\": %m",
+							path)));
+	}
+
+	pfree(xids);
+	pfree(serverids);
+	pfree(userids);
+	pfree(dbids);
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xact(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * ReadFDWXact
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+ReadFDWXacts(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+			/* Add some valid LSN */
+			fdw_xact->fdw_xact_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..ad71c0e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604..734ed48 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..4956b3d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -186,6 +187,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1917,6 +1922,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1977,6 +1985,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2134,6 +2145,7 @@ CommitTransaction(void)
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2318,6 +2330,7 @@ PrepareTransaction(void)
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
@@ -2604,6 +2617,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2189c22..0d66d1c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4906,6 +4907,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -5972,6 +5974,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transaction",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6658,7 +6663,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7274,6 +7282,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7466,6 +7475,12 @@ StartupXLOG(void)
 	RecoverPreparedTransactions();
 
 	/*
+	 * WAL reply must have created the files for prepared foreign transactions.
+	 * Reload the shared-memory foreign transaction state.
+	 */
+	ReadFDWXacts();
+
+	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
 	 */
@@ -8723,6 +8738,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9133,7 +9153,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9154,6 +9175,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9169,6 +9191,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9357,6 +9380,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9549,6 +9573,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 3870a4d..fca709d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ada2142..77de39b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -251,6 +251,15 @@ CREATE VIEW pg_prepared_xacts AS
          LEFT JOIN pg_authid U ON P.ownerid = U.oid
          LEFT JOIN pg_database D ON P.dbid = D.oid;
 
+CREATE VIEW pg_fdw_xacts AS
+	SELECT P.transaction, D.datname AS database, S.srvname AS "foreign server",
+			U.rolname AS "local user", P.status,
+			P.identifier AS "foreign transaction identifier"
+	FROM pg_fdw_xact() AS P
+		LEFT JOIN pg_authid U ON P.userid = U.oid
+		LEFT JOIN pg_database D ON P.dbid = D.oid
+		LEFT JOIN pg_foreign_server S ON P.serverid = S.oid;
+
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index eb531af..9a10696 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..c0f000c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index c04b17f..74f10b7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -141,6 +142,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -253,6 +255,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f8996cd..6589cfe 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -47,3 +47,4 @@ CommitTsLock						39
 ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
+FDWXactLock					43
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ce4eef9..7e055f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2055,6 +2056,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b1c3aea..dea5a47 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3350e13..d303e43 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -210,6 +210,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 96619a2..90cceb5 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -296,5 +296,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 525b82b..c8cf4ce 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..d6ff550 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..87636de
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void ReadFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..86448ff 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 0a595cc..9a92ce7 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..3413201 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index e2d08ba..a2272db 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5261,6 +5261,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xact	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4111 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..3383651 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,23 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													int prep_info_len, char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +237,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Supprot functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f576f05..f49334b 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -251,11 +251,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 2ae212a..aa6f203 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1332,4 +1332,8 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xact(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 00700f2..57f9e51 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,16 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT p.transaction,
+    d.datname AS database,
+    s.srvname AS "foreign server",
+    u.rolname AS "local user",
+    p.status,
+    p.identifier AS "foreign transaction identifier"
+   FROM (((pg_fdw_xact() p(dbid, transaction, serverid, userid, status, identifier)
+     LEFT JOIN pg_authid u ON ((p.userid = u.oid)))
+     LEFT JOIN pg_database d ON ((p.dbid = d.oid)))
+     LEFT JOIN pg_foreign_server s ON ((p.serverid = s.oid)));
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 14c87c9..a67793e 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2233,9 +2233,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2249,7 +2251,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_checkpoints = on\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

#64

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Ashutosh Bapat (#60)

Re: Transactions involving multiple postgres foreign servers

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1]. /messages/by-id/74355FCF-AADC-4E51-850B-47AF59E0B215@postgrespro.ru and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

[1]: . /messages/by-id/74355FCF-AADC-4E51-850B-47AF59E0B215@postgrespro.ru

On Fri, Aug 26, 2016 at 11:43 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:37 AM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 3:03 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Aug 26, 2016 at 11:22 AM, Masahiko Sawada
<sawada.mshk@gmail.com>
wrote:

On Fri, Aug 26, 2016 at 1:32 PM, Vinayak Pokale <vinpokale@gmail.com>
wrote:

Hi All,

Ashutosh proposed the feature 2PC for FDW for achieving atomic
commits
across multiple foreign servers.
If a transaction make changes to more than two foreign servers the
current
implementation in postgres_fdw doesn't make sure that either all of
them
commit or all of them rollback their changes.

We (Masahiko Sawada and me) reopen this thread and trying to
contribute
in
it.

2PC for FDW
============
The patch provides support for atomic commit for transactions
involving
foreign servers. when the transaction makes changes to foreign
servers,
either all the changes to all the foreign servers commit or rollback.

The new patch 2PC for FDW include the following things:
1. The patch 0001 introduces a generic feature. All kinds of FDW that
support 2PC such as oracle_fdw, mysql_fdw, postgres_fdw etc. can
involve
in
the transaction.

Currently we can push some conditions down to shard nodes, especially
in
9.6
the directly modify feature has
been introduced. But such a transaction modifying data on shard node
is
not
executed surely.
Using 0002 patch, that modify is executed with 2PC. It means that we
almost
can provide sharding solution using
multiple PostgreSQL server (one parent node and several shared node).

For multi master, we definitely need transaction manager but
transaction
manager probably can use this 2PC for FDW feature to manage
distributed
transaction.

2. 0002 patch makes postgres_fdw possible to use 2PC.

0002 patch makes postgres_fdw to use below APIs. These APIs are
generic
features which can be used by all kinds of FDWs.

a. Execute PREAPRE TRANSACTION and COMMIT/ABORT PREAPRED instead
of
COMMIT/ABORT on foreign server which supports 2PC.
b. Manage information of foreign prepared transactions resolver

Masahiko Sawada will post the patch.

Thanks Vinayak and Sawada-san for taking this forward and basing your
work
on my patch.

Still lot of work to do but attached latest patches.
These are based on the patch Ashutosh posted before, I revised it and
divided into two patches.
Compare with original patch, patch of pg_fdw_xact_resolver and
documentation are lacked.

I am not able to understand the last statement.

Sorry to confuse you.

Do you mean to say that your patches do not have pg_fdw_xact_resolver()
and
documentation that my patches had?

Yes.
I'm confirming them that your patches had.

Thanks for the clarification. I had added pg_fdw_xact_resolver() to resolve
any transactions which can not be resolved immediately after they were
prepared. There was a comment from Kevin (IIRC) that leaving transactions
unresolved on the foreign server keeps the resources locked on those
servers. That's not a very good situation. And nobody but the initiating
server can resolve those. That functionality is important to make it a
complete 2PC solution. So, please consider it to be included in your first
set of patches.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#64)

Re: Transactions involving multiple postgres foreign servers

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.
Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.
I think that it would be good place to start.

I'd like to discuss what the best approach is for transaction
involving foreign servers.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#65)

Re: Transactions involving multiple postgres foreign servers

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#67

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#66)

Re: Transactions involving multiple postgres foreign servers

On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and
COMMIT/ROLLBACK PREPARED explicitly.

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
So I think that we no longer need to do that.

* Adding max_prepared_foreign_transactions parameter.
It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#67)

Re: Transactions involving multiple postgres foreign servers

On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
So I think that we no longer need to do that.

COMMIT/ROLLBACK PREPARED explicitly.

The problem with this approach is same as one previously stated. If
the connection between local and foreign server is lost between
PREPARE and COMMIT the prepared transaction on the foreign server
remains dangling, none other than the local server knows what to do
with it and the local server has lost track of the prepared
transaction on the foreign server. So, just pushing down those
commands doesn't work.

* Adding max_prepared_foreign_transactions parameter.
It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

That isn't true exactly. max_prepared_foreign_transactions indicates
how many transactions can be prepared on the foreign server, which in
the method you propose should have a cap of max_prepared_transactions
* number of foreign servers.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Masahiko Sawada (#67)

Re: Transactions involving multiple postgres foreign servers

On Tue, Sep 27, 2016 at 6:24 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

And we assume that when this is used across many servers there will be
no GID conflict because each server is careful enough to generate
unique strings, say with UUIDs?
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#68)

Re: Transactions involving multiple postgres foreign servers

On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
So I think that we no longer need to do that.

COMMIT/ROLLBACK PREPARED explicitly.

The problem with this approach is same as one previously stated. If
the connection between local and foreign server is lost between
PREPARE and COMMIT the prepared transaction on the foreign server
remains dangling, none other than the local server knows what to do
with it and the local server has lost track of the prepared
transaction on the foreign server. So, just pushing down those
commands doesn't work.

Yeah, my idea is one of the first step.
Mechanism that resolves the dangling foreign transaction and the
resolver worker process are necessary.

* Adding max_prepared_foreign_transactions parameter.
It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

That isn't true exactly. max_prepared_foreign_transactions indicates
how many transactions can be prepared on the foreign server, which in
the method you propose should have a cap of max_prepared_transactions
* number of foreign servers.

Oh, I understood, thanks.

Consider sharding solution using postgres_fdw (that is, the parent
postgres server has multiple shard postgres servers), we need to
increase max_prepared_foreign_transactions whenever new shard server
is added to cluster, or to allocate enough size in advance. But the
estimation of enough max_prepared_foreign_transactions would not be
easy, for example can we estimate it by (max throughput of the system)
* (the number of foreign servers)?

One new idea I came up with is that we set transaction id on parent
server to global transaction id (gid) that is prepared on shard
server.
And pg_fdw_resolver worker process periodically resolves the dangling
transaction on foreign server by comparing active lowest XID on parent
server with the XID in gid used by PREPARE TRANSACTION.

For example, suppose that there are one parent server and one shard
server, and the client executes update transaction (XID = 100)
involving foreign servers.
In commit phase, parent server executes PREPARE TRANSACTION command
with gid containing 100, say 'px_<random
number>_100_<serverid>_<userid>', on foreign server.
If the shard server crashed before COMMIT PREPARED, the transaction
100 become danging transaction.

But resolver worker process on parent server can resolve it with
following steps.
1. Get lowest active XID on parent server(XID=110).
2. Connect to foreign server. (Get foreign server information from
pg_foreign_server system catalog.)
3. Check if there is prepared transaction with XID less than 110.
4. Rollback the dangling transaction found at #3 step.
gid 'px_<random number>_100_<serverid>_<userid>' is prepared on
foreign server by transaction 100, rollback it.

In this idea, we need gid provider API but parent server doesn't need
to have persistent foreign transaction data.
Also we could remove max_prepared_foreign_transactions, and fdw_xact.c
would become more simple implementation.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#70)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 28, 2016 at 10:43 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
So I think that we no longer need to do that.

COMMIT/ROLLBACK PREPARED explicitly.

The problem with this approach is same as one previously stated. If
the connection between local and foreign server is lost between
PREPARE and COMMIT the prepared transaction on the foreign server
remains dangling, none other than the local server knows what to do
with it and the local server has lost track of the prepared
transaction on the foreign server. So, just pushing down those
commands doesn't work.

Yeah, my idea is one of the first step.
Mechanism that resolves the dangling foreign transaction and the
resolver worker process are necessary.

* Adding max_prepared_foreign_transactions parameter.
It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

That isn't true exactly. max_prepared_foreign_transactions indicates
how many transactions can be prepared on the foreign server, which in
the method you propose should have a cap of max_prepared_transactions
* number of foreign servers.

Oh, I understood, thanks.

Consider sharding solution using postgres_fdw (that is, the parent
postgres server has multiple shard postgres servers), we need to
increase max_prepared_foreign_transactions whenever new shard server
is added to cluster, or to allocate enough size in advance. But the
estimation of enough max_prepared_foreign_transactions would not be
easy, for example can we estimate it by (max throughput of the system)
* (the number of foreign servers)?

One new idea I came up with is that we set transaction id on parent
server to global transaction id (gid) that is prepared on shard
server.
And pg_fdw_resolver worker process periodically resolves the dangling
transaction on foreign server by comparing active lowest XID on parent
server with the XID in gid used by PREPARE TRANSACTION.

For example, suppose that there are one parent server and one shard
server, and the client executes update transaction (XID = 100)
involving foreign servers.
In commit phase, parent server executes PREPARE TRANSACTION command
with gid containing 100, say 'px_<random
number>_100_<serverid>_<userid>', on foreign server.
If the shard server crashed before COMMIT PREPARED, the transaction
100 become danging transaction.

But resolver worker process on parent server can resolve it with
following steps.
1. Get lowest active XID on parent server(XID=110).
2. Connect to foreign server. (Get foreign server information from
pg_foreign_server system catalog.)
3. Check if there is prepared transaction with XID less than 110.
4. Rollback the dangling transaction found at #3 step.
gid 'px_<random number>_100_<serverid>_<userid>' is prepared on
foreign server by transaction 100, rollback it.

Why always rollback any dangling transaction? There can be a case that
a foreign server has a dangling transaction which needs to be
committed because the portions of that transaction on the other shards
are committed.

The way gid is crafted, there is no way to check whether the given
prepared transaction was created by the local server or not. Probably
the local server needs to add a unique signature in GID to identify
the transactions prepared by itself. That signature should be
transferred to standby to cope up with the fail-over of local server.
In this idea, one has to keep on polling the foreign server to find
any dangling transactions. In usual scenario, we shouldn't have a
large number of dangling transactions, and thus periodic polling might
be a waste.

In this idea, we need gid provider API but parent server doesn't need
to have persistent foreign transaction data.
Also we could remove max_prepared_foreign_transactions, and fdw_xact.c
would become more simple implementation.

I agree, but we need to cope with above two problems.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Michael Paquier

michael.paquier@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#71)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 28, 2016 at 3:30 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

I agree, but we need to cope with above two problems.

I have marked the patch as returned with feedback per the last output
Ashutosh has provided.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#71)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 28, 2016 at 3:30 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Sep 28, 2016 at 10:43 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Sep 27, 2016 at 9:06 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Sep 27, 2016 at 2:54 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 9:07 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Sep 26, 2016 at 5:25 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Sep 26, 2016 at 7:28 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

My original patch added code to manage the files for 2 phase
transactions opened by the local server on the remote servers. This
code was mostly inspired from the code in twophase.c which manages the
file for prepared transactions. The logic to manage 2PC files has
changed since [1] and has been optimized. One of the things I wanted
to do is see, if those optimizations are applicable here as well. Have
you considered that?

Yeah, we're considering it.
After these changes are committed, we will post the patch incorporated
these changes.

But what we need to do first is the discussion in order to get consensus.
Since current design of this patch is to transparently execute DCL of
2PC on foreign server, this code changes lot of code and is
complicated.

Can you please elaborate. I am not able to understand what DCL is
involved here. According to [1], examples of DCL are GRANT and REVOKE
command.

I meant transaction management command such as PREPARE TRANSACTION and
COMMIT/ABORT PREPARED command.
The web page I refered might be wrong, sorry.

Another approach I have is to push down DCL to only foreign servers
that support 2PC protocol, which is similar to DML push down.
This approach would be more simpler than current idea and is easy to
use by distributed transaction manager.

Again, can you please elaborate, how that would be different from the
current approach and how does it simplify the code.

The idea is just to push down PREPARE TRANSACTION, COMMIT/ROLLBACK
PREPARED to foreign servers that support 2PC.
With this idea, the client need to do following operation when foreign
server is involved with transaction.

BEGIN;
UPDATE parent_table SET ...; -- update including foreign server
PREPARE TRANSACTION 'xact_id';
COMMIT PREPARED 'xact_id';

The above PREPARE TRANSACTION and COMMIT PREPARED command are pushed
down to foreign server.
That is, the client needs to execute PREPARE TRANSACTION and

In this idea, I think that we don't need to do followings,

* Providing the prepare id of 2PC.
Current patch adds new API prepare_id_provider() but we can use the
prepare id of 2PC that is used on parent server.

* Keeping track of status of foreign servers.
Current patch keeps track of status of foreign servers involved with
transaction but this idea is just to push down transaction management
command to foreign server.
So I think that we no longer need to do that.

COMMIT/ROLLBACK PREPARED explicitly.

The problem with this approach is same as one previously stated. If
the connection between local and foreign server is lost between
PREPARE and COMMIT the prepared transaction on the foreign server
remains dangling, none other than the local server knows what to do
with it and the local server has lost track of the prepared
transaction on the foreign server. So, just pushing down those
commands doesn't work.

Yeah, my idea is one of the first step.
Mechanism that resolves the dangling foreign transaction and the
resolver worker process are necessary.

* Adding max_prepared_foreign_transactions parameter.
It means that the number of transaction involving foreign server is
the same as max_prepared_transactions.

That isn't true exactly. max_prepared_foreign_transactions indicates
how many transactions can be prepared on the foreign server, which in
the method you propose should have a cap of max_prepared_transactions
* number of foreign servers.

Oh, I understood, thanks.

Consider sharding solution using postgres_fdw (that is, the parent
postgres server has multiple shard postgres servers), we need to
increase max_prepared_foreign_transactions whenever new shard server
is added to cluster, or to allocate enough size in advance. But the
estimation of enough max_prepared_foreign_transactions would not be
easy, for example can we estimate it by (max throughput of the system)
* (the number of foreign servers)?

One new idea I came up with is that we set transaction id on parent
server to global transaction id (gid) that is prepared on shard
server.
And pg_fdw_resolver worker process periodically resolves the dangling
transaction on foreign server by comparing active lowest XID on parent
server with the XID in gid used by PREPARE TRANSACTION.

For example, suppose that there are one parent server and one shard
server, and the client executes update transaction (XID = 100)
involving foreign servers.
In commit phase, parent server executes PREPARE TRANSACTION command
with gid containing 100, say 'px_<random
number>_100_<serverid>_<userid>', on foreign server.
If the shard server crashed before COMMIT PREPARED, the transaction
100 become danging transaction.

But resolver worker process on parent server can resolve it with
following steps.
1. Get lowest active XID on parent server(XID=110).
2. Connect to foreign server. (Get foreign server information from
pg_foreign_server system catalog.)
3. Check if there is prepared transaction with XID less than 110.
4. Rollback the dangling transaction found at #3 step.
gid 'px_<random number>_100_<serverid>_<userid>' is prepared on
foreign server by transaction 100, rollback it.

Why always rollback any dangling transaction? There can be a case that
a foreign server has a dangling transaction which needs to be
committed because the portions of that transaction on the other shards
are committed.

Right, we can heuristically make a decision whether we do COMMIT or
ABORT on local server.
For example, if COMMIT PREPARED succeeded on at least one foreign
server, the local server return OK to client and the other dangling
transactions should be committed later.
We can find out that we should do either commit or abort the dangling
transaction by checking CLOG.

But we need to handle the case where the CLOG file containing XID
necessary for resolving dangling transaction is truncated.
If the user does VACUUM FREEZE just after remote server crashed, it
could be truncated.

The way gid is crafted, there is no way to check whether the given
prepared transaction was created by the local server or not. Probably
the local server needs to add a unique signature in GID to identify
the transactions prepared by itself. That signature should be
transferred to standby to cope up with the fail-over of local server.

Maybe we can use database system identifier in control file.

In this idea, one has to keep on polling the foreign server to find
any dangling transactions. In usual scenario, we shouldn't have a
large number of dangling transactions, and thus periodic polling might
be a waste.

We can optimize it by storing the XID that is resolved heuristically
into the control file or system catalog, for example.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#73)

Re: Transactions involving multiple postgres foreign servers

Why always rollback any dangling transaction? There can be a case that
a foreign server has a dangling transaction which needs to be
committed because the portions of that transaction on the other shards
are committed.

Right, we can heuristically make a decision whether we do COMMIT or
ABORT on local server.
For example, if COMMIT PREPARED succeeded on at least one foreign
server, the local server return OK to client and the other dangling
transactions should be committed later.
We can find out that we should do either commit or abort the dangling
transaction by checking CLOG.

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

But we need to handle the case where the CLOG file containing XID
necessary for resolving dangling transaction is truncated.
If the user does VACUUM FREEZE just after remote server crashed, it
could be truncated.

Hmm, this needs to be fixed. Even my patch relied on XID to determine
whether the transaction committed or rolled back locally and thus to
decide whether it should be committed or rolled back on all the
foreign servers involved. I think I had taken care of the issue you
have pointed out here. Can you please verify the same?

The way gid is crafted, there is no way to check whether the given
prepared transaction was created by the local server or not. Probably
the local server needs to add a unique signature in GID to identify
the transactions prepared by itself. That signature should be
transferred to standby to cope up with the fail-over of local server.

Maybe we can use database system identifier in control file.

may be.

In this idea, one has to keep on polling the foreign server to find
any dangling transactions. In usual scenario, we shouldn't have a
large number of dangling transactions, and thus periodic polling might
be a waste.

We can optimize it by storing the XID that is resolved heuristically
into the control file or system catalog, for example.

There will be many such XIDs. We don't want to dump so many things in
control file, esp. when that's not control data. System catalog is out
of question since a rollback of local transaction would make those
rows in the system catalog invisible. That's the reason, why I chose
to write the foreign prepared transactions to files rather than a
system catalog.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 9 years ago

In reply to: Ashutosh Bapat (#74)

Re: Transactions involving multiple postgres foreign servers

Hi,

On 2016/10/04 13:26, Ashutosh Bapat wrote:

Why always rollback any dangling transaction? There can be a case that
a foreign server has a dangling transaction which needs to be
committed because the portions of that transaction on the other shards
are committed.

Right, we can heuristically make a decision whether we do COMMIT or
ABORT on local server.
For example, if COMMIT PREPARED succeeded on at least one foreign
server, the local server return OK to client and the other dangling
transactions should be committed later.
We can find out that we should do either commit or abort the dangling
transaction by checking CLOG.

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I wonder if Sawada-san is referring to some sort of quorum-based (atomic)
commitment protocol [1, 2], although I agree that that would be an
advanced technique for handling the limitations such as blocking nature of
the basic two-phase commit protocol in case of communication failures,
IOW, meant for better availability rather than correctness.

Thanks,
Amit

[1]: https://en.wikipedia.org/wiki/Quorum_(distributed_computing)#Quorum-based_voting_in_commit_protocols
https://en.wikipedia.org/wiki/Quorum_(distributed_computing)#Quorum-based_voting_in_commit_protocols

[2]: http://hub.hku.hk/bitstream/10722/158032/1/Content.pdf

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#76

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#74)

On Tue, Oct 4, 2016 at 1:26 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com <javascript:;>> wrote:

Why always rollback any dangling transaction? There can be a case that
a foreign server has a dangling transaction which needs to be
committed because the portions of that transaction on the other shards
are committed.

Right, we can heuristically make a decision whether we do COMMIT or
ABORT on local server.
For example, if COMMIT PREPARED succeeded on at least one foreign
server, the local server return OK to client and the other dangling
transactions should be committed later.
We can find out that we should do either commit or abort the dangling
transaction by checking CLOG.

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one
remote server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

But we need to handle the case where the CLOG file containing XID
necessary for resolving dangling transaction is truncated.
If the user does VACUUM FREEZE just after remote server crashed, it
could be truncated.

Hmm, this needs to be fixed. Even my patch relied on XID to determine
whether the transaction committed or rolled back locally and thus to
decide whether it should be committed or rolled back on all the
foreign servers involved. I think I had taken care of the issue you
have pointed out here. Can you please verify the same?

The way gid is crafted, there is no way to check whether the given
prepared transaction was created by the local server or not. Probably
the local server needs to add a unique signature in GID to identify
the transactions prepared by itself. That signature should be
transferred to standby to cope up with the fail-over of local server.

Maybe we can use database system identifier in control file.

may be.

In this idea, one has to keep on polling the foreign server to find
any dangling transactions. In usual scenario, we shouldn't have a
large number of dangling transactions, and thus periodic polling might
be a waste.

We can optimize it by storing the XID that is resolved heuristically
into the control file or system catalog, for example.

There will be many such XIDs. We don't want to dump so many things in
control file, esp. when that's not control data. System catalog is out
of question since a rollback of local transaction would make those
rows in the system catalog invisible. That's the reason, why I chose
to write the foreign prepared transactions to files rather than a
system catalog.

We can store the lowest in-doubt transaction id (say in-doubt XID) that
needs to be resolved later into control file and the CLOG containing XID
greater than in-doubt XID is never truncated.
We need to try to solve such transaction only when in-doubt XID is not NULL.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#77

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#76)

Re: Transactions involving multiple postgres foreign servers

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one remote
server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

The local server doesn't wait for the completion of the second phase
to finish the currently running statement. Once all the foreign
servers have responded to PREPARE request in the first phase, the
local server responds to the client. Am I missing something?

There will be many such XIDs. We don't want to dump so many things in
control file, esp. when that's not control data. System catalog is out
of question since a rollback of local transaction would make those
rows in the system catalog invisible. That's the reason, why I chose
to write the foreign prepared transactions to files rather than a
system catalog.

We can store the lowest in-doubt transaction id (say in-doubt XID) that
needs to be resolved later into control file and the CLOG containing XID
greater than in-doubt XID is never truncated.
We need to try to solve such transaction only when in-doubt XID is not NULL.

IIRC, my patch takes care of this. If the oldest active transaction
happens to be later in the time line than the oldest in-doubt
transaction, it sets oldest active transaction id to that of the
oldest in-doubt transaction.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 9 years ago

In reply to: Ashutosh Bapat (#77)

Re: Transactions involving multiple postgres foreign servers

On 2016/10/04 16:10, Ashutosh Bapat wrote:

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one remote
server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

The local server doesn't wait for the completion of the second phase
to finish the currently running statement. Once all the foreign
servers have responded to PREPARE request in the first phase, the
local server responds to the client. Am I missing something?

PREPARE sent to foreign servers involved in a given transaction is
*transparent* to the user who started the transaction, no? That is, user
just says COMMIT and if it is found that there are multiple servers
involved in the transaction, it must be handled using two-phase commit
protocol *behind the scenes*. So the aforementioned COMMIT should not
return to the client until after the above two-phase commit processing has
finished.

Or are you and Sawada-san talking about the case where the user issued
PREPARE and not COMMIT?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Amit Langote (#78)

Re: Transactions involving multiple postgres foreign servers

On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/10/04 16:10, Ashutosh Bapat wrote:

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one remote
server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

The local server doesn't wait for the completion of the second phase
to finish the currently running statement. Once all the foreign
servers have responded to PREPARE request in the first phase, the
local server responds to the client. Am I missing something?

PREPARE sent to foreign servers involved in a given transaction is
*transparent* to the user who started the transaction, no? That is, user
just says COMMIT and if it is found that there are multiple servers
involved in the transaction, it must be handled using two-phase commit
protocol *behind the scenes*. So the aforementioned COMMIT should not
return to the client until after the above two-phase commit processing has
finished.

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase, which can take
quite long (or never) if one of the servers has crashed in between.

Or are you and Sawada-san talking about the case where the user issued
PREPARE and not COMMIT?

I guess, Sawada-san is still talking about the user issued PREPARE.
But my comment is applicable otherwise as well.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#80

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#79)

Re: Transactions involving multiple postgres foreign servers

On Tue, Oct 4, 2016 at 8:29 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Oct 4, 2016 at 1:11 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/10/04 16:10, Ashutosh Bapat wrote:

Heuristics can not become the default behavior. A user should be given
an option to choose a heuristic, and he should be aware of the
pitfalls when using this heuristic. I guess, first, we need to get a
solution which ensures that the transaction gets committed on all the
servers or is rolled back on all the foreign servers involved. AFAIR,
my patch did that. Once we have that kind of solution, we can think
about heuristics.

I meant that we could determine it heuristically only when remote server
crashed in 2nd phase of 2PC.
For example, what does the local server returns to client when no one remote
server returns OK to local server in 2nd phase of 2PC for more than
statement_timeout seconds? Ok or error?

The local server doesn't wait for the completion of the second phase
to finish the currently running statement. Once all the foreign
servers have responded to PREPARE request in the first phase, the
local server responds to the client. Am I missing something?

PREPARE sent to foreign servers involved in a given transaction is
*transparent* to the user who started the transaction, no? That is, user
just says COMMIT and if it is found that there are multiple servers
involved in the transaction, it must be handled using two-phase commit
protocol *behind the scenes*. So the aforementioned COMMIT should not
return to the client until after the above two-phase commit processing has
finished.

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.
Otherwise the next transaction can see data that is not committed yet
on remote server.

, which can take
quite long (or never) if one of the servers has crashed in between.

Or are you and Sawada-san talking about the case where the user issued
PREPARE and not COMMIT?

I guess, Sawada-san is still talking about the user issued PREPARE.
But my comment is applicable otherwise as well.

Yes, I'm considering the case where the local server tries to COMMIT
but the remote server crashed after the local server completes 1st
phase (PREPARE) on the all remote server.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#81

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#80)

Re: Transactions involving multiple postgres foreign servers

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message. I don't think
that's desirable.

Otherwise the next transaction can see data that is not committed yet
on remote server.

2PC doesn't guarantee transactional consistency all by itself. It only
guarantees that all legs of a distributed transaction are either all
rolled back or all committed. IOW, it guarantees that a distributed
transaction is not rolled back on some nodes and committed on the
other node.

Providing a transactionally consistent view is a very hard problem.
Trying to solve all those problems in a single patch would be very
difficult and the amount of changes required may be really huge. Then
there are many possible consistency definitions when it comes to
consistency of distributed system. I have not seen a consensus on what
kind of consistency model/s we want to support in PostgreSQL. That's
another large debate. We have had previous attempts where people have
tried to complete everything in one go and nothing has been completed
yet.

2PC implementation OR guaranteeing that all the legs of a transaction
commit or roll back, is an essential block of any kind of distributed
transaction manager. So, we should at least support that one, before
attacking further problems.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#81)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol.
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

Otherwise the next transaction can see data that is not committed yet
on remote server.

2PC doesn't guarantee transactional consistency all by itself. It only
guarantees that all legs of a distributed transaction are either all
rolled back or all committed. IOW, it guarantees that a distributed
transaction is not rolled back on some nodes and committed on the
other node.
Providing a transactionally consistent view is a very hard problem.
Trying to solve all those problems in a single patch would be very
difficult and the amount of changes required may be really huge. Then
there are many possible consistency definitions when it comes to
consistency of distributed system. I have not seen a consensus on what
kind of consistency model/s we want to support in PostgreSQL. That's
another large debate. We have had previous attempts where people have
tried to complete everything in one go and nothing has been completed
yet.

Yes, providing a atomic visibility is hard problem, and it's a
separated issue[2]http://www.bailis.org/papers/ramp-sigmod2014.pdf.

2PC implementation OR guaranteeing that all the legs of a transaction
commit or roll back, is an essential block of any kind of distributed
transaction manager. So, we should at least support that one, before
attacking further problems.

I agree.

[1]: https://en.wikipedia.org/wiki/Two-phase_commit_protocol
[2]: http://www.bailis.org/papers/ramp-sigmod2014.pdf

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#82)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

No, the COMMIT returns after the first phase. It can not wait for all
the foreign servers to complete their second phase

Hm, it sounds like it's same as normal commit (not 2PC).
What's the difference?

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

This too, looks like a heuristic and shouldn't be the default
behaviour and hence not part of the first version of this feature.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 9 years ago

In reply to: Ashutosh Bapat (#83)

Re: Transactions involving multiple postgres foreign servers

On 2016/10/06 17:45, Ashutosh Bapat wrote:

On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

This too, looks like a heuristic and shouldn't be the default
behaviour and hence not part of the first version of this feature.

At any rate, the coordinator should not return to the client until after
the 2nd phase is completed, which was the original point. If COMMIT
taking longer is an issue, then it could be handled with one of the
approaches mentioned so far (even if not in the first version), but no
version of this feature should really return COMMIT to the client only
after finishing the first phase. Am I missing something?

I am saying this because I am assuming that this feature means the client
itself does not invoke 2PC, even knowing that there are multiple servers
involved, but rather rely on the involved FDW drivers and related core
code handling it transparently. I may have misunderstood the feature
though, apologies if so.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Amit Langote (#84)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/10/06 17:45, Ashutosh Bapat wrote:

On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

This too, looks like a heuristic and shouldn't be the default
behaviour and hence not part of the first version of this feature.

At any rate, the coordinator should not return to the client until after
the 2nd phase is completed, which was the original point. If COMMIT
taking longer is an issue, then it could be handled with one of the
approaches mentioned so far (even if not in the first version), but no
version of this feature should really return COMMIT to the client only
after finishing the first phase. Am I missing something?

There is small time window between actual COMMIT and a commit message
returned. An actual commit happens when we insert a WAL saying
transaction X committed and then we return to the client saying a
COMMIT happened. Note that a transaction may be committed but we will
never return to the client with a commit message, because connection
was lost or the server crashed. I hope we agree on this.

COMMITTING the foreign prepared transactions happens after we COMMIT
the local transaction. If we do it before COMMITTING local transaction
and the local server crashes, we will roll back local transaction
during subsequence recovery while the foreign segments have committed
resulting in an inconsistent state.

If we are successful in COMMITTING foreign transactions during
post-commit phase, COMMIT message will be returned after we have
committed all foreign transactions. But in case we can not reach a
foreign server, and request times out, we can not revert back our
decision that we are going to commit the transaction. That's my answer
to the timeout based heuristic.

I don't see much point in holding up post-commit processing for a
non-responsive foreign server, which may not respond for days
together. Can you please elaborate a use case? Which commercial
transaction manager does that?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Masahiko Sawada

sawada.mshk@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#85)

Re: Transactions involving multiple postgres foreign servers

On Fri, Oct 7, 2016 at 4:25 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 6, 2016 at 2:52 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2016/10/06 17:45, Ashutosh Bapat wrote:

On Thu, Oct 6, 2016 at 1:34 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Oct 6, 2016 at 1:41 PM, Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

My understanding is that basically the local server can not return
COMMIT to the client until 2nd phase is completed.

If we do that, the local server may not return to the client at all,
if the foreign server crashes and never comes up. Practically, it may
take much longer to finish a COMMIT, depending upon how long it takes
for the foreign server to reply to a COMMIT message.

Yes, I think 2PC behaves so, please refer to [1].
To prevent local server stops forever due to communication failure.,
we could provide the timeout on coordinator side or on participant
side.

This too, looks like a heuristic and shouldn't be the default
behaviour and hence not part of the first version of this feature.

At any rate, the coordinator should not return to the client until after
the 2nd phase is completed, which was the original point. If COMMIT
taking longer is an issue, then it could be handled with one of the
approaches mentioned so far (even if not in the first version), but no
version of this feature should really return COMMIT to the client only
after finishing the first phase. Am I missing something?

There is small time window between actual COMMIT and a commit message
returned. An actual commit happens when we insert a WAL saying
transaction X committed and then we return to the client saying a
COMMIT happened. Note that a transaction may be committed but we will
never return to the client with a commit message, because connection
was lost or the server crashed. I hope we agree on this.

Agree.

COMMITTING the foreign prepared transactions happens after we COMMIT
the local transaction. If we do it before COMMITTING local transaction
and the local server crashes, we will roll back local transaction
during subsequence recovery while the foreign segments have committed
resulting in an inconsistent state.

If we are successful in COMMITTING foreign transactions during
post-commit phase, COMMIT message will be returned after we have
committed all foreign transactions. But in case we can not reach a
foreign server, and request times out, we can not revert back our
decision that we are going to commit the transaction. That's my answer
to the timeout based heuristic.

IIUC 2PC is the protocol that assumes that all of the foreign server live.
In case we can not reach a foreign server during post-commit phase,
basically the transaction and following transaction should stop until
the crashed server revived. This is the first place to implement 2PC
for FDW, I think. The heuristically determination approach I mentioned
is one of the optimization idea to avoid holding up transaction in
case a foreign server crashed.

I don't see much point in holding up post-commit processing for a
non-responsive foreign server, which may not respond for days
together. Can you please elaborate a use case? Which commercial
transaction manager does that?

For example, the client updates a data on foreign server and then
commits. And the next transaction from the same client selects new
data which was updated on previous transaction. In this case, because
the first transaction is committed the second transaction should be
able to see updated data, but it can see old data in your idea. Since
these is obviously order between first transaction and second
transaction I think that It's not problem of providing consistent
view.

I guess transaction manager of Postgres-XC behaves so, no?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Masahiko Sawada (#86)

Re: Transactions involving multiple postgres foreign servers

If we are successful in COMMITTING foreign transactions during
post-commit phase, COMMIT message will be returned after we have
committed all foreign transactions. But in case we can not reach a
foreign server, and request times out, we can not revert back our
decision that we are going to commit the transaction. That's my answer
to the timeout based heuristic.

IIUC 2PC is the protocol that assumes that all of the foreign server live.

Do you have any references? Take a look at [1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol. The first paragraph
itself mentions that 2PC can achieve its goals despite temporary
failures.

In case we can not reach a foreign server during post-commit phase,
basically the transaction and following transaction should stop until
the crashed server revived.

I have repeatedly given reasons why this is not correct. You and Amit
seem to repeat this statement again and again in turns without giving
any concrete reasons about why this is so.

This is the first place to implement 2PC
for FDW, I think. The heuristically determination approach I mentioned
is one of the optimization idea to avoid holding up transaction in
case a foreign server crashed.

I don't see much point in holding up post-commit processing for a
non-responsive foreign server, which may not respond for days
together. Can you please elaborate a use case? Which commercial
transaction manager does that?

For example, the client updates a data on foreign server and then
commits. And the next transaction from the same client selects new
data which was updated on previous transaction. In this case, because
the first transaction is committed the second transaction should be
able to see updated data, but it can see old data in your idea. Since
these is obviously order between first transaction and second
transaction I think that It's not problem of providing consistent
view.

2PC doesn't guarantee this. For that you need other methods and
protocols. We have discussed this before. [2]/messages/by-id/CAD21AoCTe1CFfA9g1uqETvLaJZfFH6QoPSDf-L3KZQ-CDZ7q8g@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company

[1]: https://en.wikipedia.org/wiki/Two-phase_commit_protocol
[2]: /messages/by-id/CAD21AoCTe1CFfA9g1uqETvLaJZfFH6QoPSDf-L3KZQ-CDZ7q8g@mail.gmail.com -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

about 9 years ago

In reply to: Ashutosh Bapat (#87)

Re: Transactions involving multiple postgres foreign servers

On 2016/10/13 19:37, Ashutosh Bapat wrote:

In case we can not reach a foreign server during post-commit phase,
basically the transaction and following transaction should stop until
the crashed server revived.

I have repeatedly given reasons why this is not correct. You and Amit
seem to repeat this statement again and again in turns without giving
any concrete reasons about why this is so.

As mentioned in description of the "Commit" or "Completion" phase in the
Wikipedia article [1]https://en.wikipedia.org/wiki/Two-phase_commit_protocol#Commit_phase:

* Success

If the coordinator received an agreement message from all cohorts during
the commit-request phase:

1. The coordinator sends a commit message to all the cohorts.

2. Each cohort completes the operation, and releases all the locks and
resources held during the transaction.

3. Each cohort sends an acknowledgment to the coordinator.

4. The coordinator completes the transaction when all acknowledgments
have been received.

* Failure

If any cohort votes No during the commit-request phase (or the
coordinator's timeout expires):

1. The coordinator sends a rollback message to all the cohorts.

2. Each cohort undoes the transaction using the undo log, and releases
the resources and locks held during the transaction.

3. Each cohort sends an acknowledgement to the coordinator.

4. The coordinator undoes the transaction when all acknowledgements have
been received.

In point 4 of both commit and abort cases above, it's been said, "when
*all* acknowledgements have been received."

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2]http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/p378-mohan.pdf, it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

"""
2. THE TWO-PHASE COMMIT PROTOCOL

...

After the coordinator receives the votes from all its subordinates, it
initiates the second phase of the protocol. If all the votes were YES
VOTES, then the coordinator moves to the committing state by force-writing
a commit record and sending COMMIT messages to all the subordinates. The
completion of the force-write takes the transaction to its commit point.
Once this point is passed the user can be told that the transaction has
been committed.
...

"""

Sorry about the noise.

Thanks,
Amit

[1]: https://en.wikipedia.org/wiki/Two-phase_commit_protocol#Commit_phase

[2]: http://www.cs.cmu.edu/~natassa/courses/15-823/F02/papers/p378-mohan.pdf

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#87)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 13, 2016 at 7:37 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

If we are successful in COMMITTING foreign transactions during
post-commit phase, COMMIT message will be returned after we have
committed all foreign transactions. But in case we can not reach a
foreign server, and request times out, we can not revert back our
decision that we are going to commit the transaction. That's my answer
to the timeout based heuristic.

IIUC 2PC is the protocol that assumes that all of the foreign server live.

Do you have any references? Take a look at [1]. The first paragraph
itself mentions that 2PC can achieve its goals despite temporary
failures.

I guess that It doesn't mention that 2PC can it by ignoring temporary failures.
Even by waiting for the crashed server revives, 2PC can achieve its goals.

In case we can not reach a foreign server during post-commit phase,
basically the transaction and following transaction should stop until
the crashed server revived.

I have repeatedly given reasons why this is not correct. You and Amit
seem to repeat this statement again and again in turns without giving
any concrete reasons about why this is so.

This is the first place to implement 2PC
for FDW, I think. The heuristically determination approach I mentioned
is one of the optimization idea to avoid holding up transaction in
case a foreign server crashed.

I don't see much point in holding up post-commit processing for a
non-responsive foreign server, which may not respond for days
together. Can you please elaborate a use case? Which commercial
transaction manager does that?

For example, the client updates a data on foreign server and then
commits. And the next transaction from the same client selects new
data which was updated on previous transaction. In this case, because
the first transaction is committed the second transaction should be
able to see updated data, but it can see old data in your idea. Since
these is obviously order between first transaction and second
transaction I think that It's not problem of providing consistent
view.

2PC doesn't guarantee this. For that you need other methods and
protocols. We have discussed this before. [2]

At any rate, I think that it would confuse the user that there is no
guarantee that the latest data updated by previous transaction can be
seen by following transaction. I don't think that it's worth enough to
immolate in order to get better performance.
Providing atomic visibility for concurrency transaction would be
supported later.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#90

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Langote (#88)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2], it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs. In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process. Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes. This can't happen in design #1. On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug. If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart. In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits. On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true. You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR. But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure. We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error. And what about interrupts? We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here. We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user. But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically. When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker. If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems. The XTM API proposed by Postgres Pro aims to address both
with a single stroke. I don't think that API was well-designed, but
maybe the idea is good even if the code is not. Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work. One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here. The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol. It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility? If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

I know I was one of the people initially advocating this approach, but
I'm no longer convinced that it's going to work out well. I don't
mean that we should abandon all work on this topic, or even less all
discussion, but I think we should be careful not to get so sucked into
the details of perfecting this particular patch that we ignore the
bigger design questions here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#91

Bruce Momjian

bruce@momjian.us

about 9 years ago

In reply to: Robert Haas (#90)

Re: Transactions involving multiple postgres foreign servers

On Wed, Oct 19, 2016 at 11:47:25AM -0400, Robert Haas wrote:

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.

Yes, you really need both commit on foreign servers before acknowledging
commit to the client, and a background process to clean things up from
an abandoned server.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#92

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Robert Haas (#90)

Re: Transactions involving multiple postgres foreign servers

On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2], it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs. In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process. Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes. This can't happen in design #1. On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug. If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart. In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits. On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true. You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR. But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure. We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error. And what about interrupts? We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here. We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user. But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
tries to resolve prepared transactions in post-commit code. I agree
with you here, that it should be avoided and the backend should take
over the job of resolving transactions.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically. When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker. If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

My patch and hence patch by Masahiko-san and Vinayak have the
background worker in the equation. The background worker tries to
resolve prepared transactions on the foreign server periodically.
IIRC, sending it a signal when another backend creates foreign
prepared transactions is not implemented. That may be a good addition.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems. The XTM API proposed by Postgres Pro aims to address both
with a single stroke. I don't think that API was well-designed, but
maybe the idea is good even if the code is not. Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work. One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here. The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol. It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility? If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

There are two problems to solve as far as visibility is concerned. 1.
Consistency: changes by which transactions are visible to a given
transaction 2. Making visible, the changes by all the segments of a
given distributed transaction on different foreign servers, at the
same time IOW no other transaction sees changes by only few segments
but does not see changes by all the transactions.

First problem is hard to solve and there are many consistency
symantics. A large topic of discussion.

The second problem can be solved on top of this infrastructure by
extending PREPARE transaction API. I am writing down my ideas so that
they don't get lost. It's not a completed design.

Assume that we have syntax which tells the originating server which
prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
server name> with ID <xid> ,where xid is the transaction identifier on
local server. OR we may incorporate that information in GID itself and
the foreign server knows how to decode it.

Once we have that information, the foreign server can actively poll
the local server to get the status of transaction xid and resolves the
prepared transaction itself. It can go a step further and inform the
local server that it has resolved the transaction, so that the local
server can purge it from it's own state. It can remember the fate of
xid, which can be consulted by another foreign server if the local
server is down. If another transaction on the foreign server stumbles
on a transaction prepared (but not resolved) by the local server,
foreign server has two options - 1. consult the local server and
resolve 2. if the first options fails to get the status of xid or that
if that option is not workable, throw an error e.g. indoubt
transaction. There is probably more network traffic happening here.
Usually, the local server should be able to resolve the transaction
before any other transaction stumbles upon it. The overhead is
incurred only when necessary.

--
Best Wishes
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#93

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#92)

Re: Transactions involving multiple postgres foreign servers

On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

However, when I briefly read the description in "Transaction Management in
the R* Distributed Database Management System (C. Mohan et al)" [2], it
seems that what Ashutosh is saying might be a correct way to proceed after
all:

I think Ashutosh is mostly right, but I think there's a lot of room to
doubt whether the design of this patch is good enough that we should
adopt it.

Consider two possible designs. In design #1, the leader performs the
commit locally and then tries to send COMMIT PREPARED to every standby
server afterward, and only then acknowledges the commit to the client.
In design #2, the leader performs the commit locally and then
acknowledges the commit to the client at once, leaving the task of
running COMMIT PREPARED to some background process. Design #2
involves a race condition, because it's possible that the background
process might not complete COMMIT PREPARED on every node before the
user submits the next query, and that query might then fail to see
supposedly-committed changes. This can't happen in design #1. On the
other hand, there's always the possibility that the leader's session
is forcibly killed, even perhaps by pulling the plug. If the
background process contemplated by design #2 is well-designed, it can
recover and finish sending COMMIT PREPARED to each relevant server
after the next restart. In design #1, that background process doesn't
necessarily exist, so inevitably there is a possibility of orphaning
prepared transactions on the remote servers, which is not good. Even
if the DBA notices them, it won't be easy to figure out whether to
commit them or roll them back.

I think this thought experiment shows that, on the one hand, there is
a point to waiting for commits on the foreign servers, because it can
avoid the anomaly of not seeing the effects of your own commits. On
the other hand, it's ridiculous to suppose that every case can be
handled by waiting, because that just isn't true. You can't be sure
that you'll be able to wait long enough for COMMIT PREPARED to
complete, and even if that works out, you may not want to wait
indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has
no value whatsoever unless the system design is such that failing to
wait for it results in the ROLLBACK PREPARED never getting performed
-- which is a pretty poor excuse.

Moreover, there are good reasons to think that doing this kind of
cleanup work in the post-commit hooks is never going to be acceptable.
Generally, the post-commit hooks need to be no-fail, because it's too
late to throw an ERROR. But there's very little hope that a
connection to a remote server can be no-fail; anything that involves a
network connection is, by definition, prone to failure. We can try to
guarantee that every single bit of code that runs in the path that
sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an
ERROR, but that's going to be quite difficult to do: even palloc() can
throw an error. And what about interrupts? We don't want to be stuck
inside this code for a long time without any hope of the user
recovering control of the session by pressing ^C, but of course the
way that works is it throws an ERROR, which we can't handle here. We
fixed a similar issue for synchronous replication in
9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a
WARNING in that case and then return control to the user. But if we
do that here, all of the code that every FDW emits has to be aware of
that rule and follow it, and it just adds to the list of ways that the
user backend can escape this code without having cleaned up all of the
prepared transactions on the remote side.

Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak,
tries to resolve prepared transactions in post-commit code. I agree
with you here, that it should be avoided and the backend should take
over the job of resolving transactions.

It seems to me that the only way to really make this feature robust is
to have a background worker as part of the equation. The background
worker launches at startup and looks around for local state that tells
it whether there are any COMMIT PREPARED or ROLLBACK PREPARED
operations pending that weren't completed during the last server
lifetime, whether because of a local crash or remote unavailability.
It attempts to complete those and retries periodically. When a new
transaction needs this type of coordination, it adds the necessary
crash-proof state and then signals the background worker. If
appropriate, it can wait for the background worker to complete, just
like a CHECKPOINT waits for the checkpointer to finish -- but if the
CHECKPOINT command is interrupted, the actual checkpoint is
unaffected.

My patch and hence patch by Masahiko-san and Vinayak have the
background worker in the equation. The background worker tries to
resolve prepared transactions on the foreign server periodically.
IIRC, sending it a signal when another backend creates foreign
prepared transactions is not implemented. That may be a good addition.

More broadly, the question has been raised as to whether it's right to
try to handle atomic commit and atomic visibility as two separate
problems. The XTM API proposed by Postgres Pro aims to address both
with a single stroke. I don't think that API was well-designed, but
maybe the idea is good even if the code is not. Generally, there are
two ways in which you could imagine that a distributed version of
PostgreSQL might work. One possibility is that one node makes
everything work by going around and giving instructions to the other
nodes, which are more or less unaware that they are part of a cluster.
That is basically the design of Postgres-XC and certainly the design
being proposed here. The other possibility is that the nodes are
actually clustered in some way and agree on things like whether a
transaction committed or what snapshot is current using some kind of
consensus protocol. It is obviously possible to get a fairly long way
using the first approach but it seems likely that the second one is
fundamentally more powerful: among other things, because the first
approach is so centralized, the leader is apt to become a bottleneck.
And, quite apart from that, can a centralized architecture with the
leader manipulating the other workers ever allow for atomic
visibility? If atomic visibility can build on top of atomic commit,
then it makes sense to do atomic commit first, but if we build this
infrastructure and then find that we need an altogether different
solution for atomic visibility, that will be unfortunate.

There are two problems to solve as far as visibility is concerned. 1.
Consistency: changes by which transactions are visible to a given
transaction 2. Making visible, the changes by all the segments of a
given distributed transaction on different foreign servers, at the
same time IOW no other transaction sees changes by only few segments
but does not see changes by all the transactions.

First problem is hard to solve and there are many consistency
symantics. A large topic of discussion.

The second problem can be solved on top of this infrastructure by
extending PREPARE transaction API. I am writing down my ideas so that
they don't get lost. It's not a completed design.

Assume that we have syntax which tells the originating server which
prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local
server name> with ID <xid> ,where xid is the transaction identifier on
local server. OR we may incorporate that information in GID itself and
the foreign server knows how to decode it.

Once we have that information, the foreign server can actively poll
the local server to get the status of transaction xid and resolves the
prepared transaction itself. It can go a step further and inform the
local server that it has resolved the transaction, so that the local
server can purge it from it's own state. It can remember the fate of
xid, which can be consulted by another foreign server if the local
server is down. If another transaction on the foreign server stumbles
on a transaction prepared (but not resolved) by the local server,
foreign server has two options - 1. consult the local server and
resolve 2. if the first options fails to get the status of xid or that
if that option is not workable, throw an error e.g. indoubt
transaction. There is probably more network traffic happening here.
Usually, the local server should be able to resolve the transaction
before any other transaction stumbles upon it. The overhead is
incurred only when necessary.

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit. We can't provide the atomic visibility across multiple
nodes without consistent update. So I'd like to focus on atomic commit
in this thread. Considering to providing the atomic commit, the two
phase commit protocol is the perfect solution for providing atomic
commit. Whatever type of solution for atomic visibility we have, the
atomic commit by 2PC is necessary feature. We can consider to have the
atomic commit feature that ha following functionalities.
* The local node is responsible for the transaction management among
relevant remote servers using 2PC.
* The local node has information about the state of distributed
transaction state.
* There is a process resolving in-doubt transaction.

As Ashutosh mentioned, current patch supports almost these
functionalities. But I'm trying to update it so that it can have
multiple foreign server information into one FDWXact file, one entry
on shared buffer. Because in spite of that new remote server can be
added on the fly, we could need to restart local server in order to
allocate the more large shared buffer for fdw transaction whenever
remote server is added. Also I'm incorporating other comments.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#94

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#92)

Re: Transactions involving multiple postgres foreign servers

On Fri, Oct 21, 2016 at 1:38 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Once we have that information, the foreign server can actively poll
the local server to get the status of transaction xid and resolves the
prepared transaction itself. It can go a step further and inform the
local server that it has resolved the transaction, so that the local
server can purge it from it's own state. It can remember the fate of
xid, which can be consulted by another foreign server if the local
server is down. If another transaction on the foreign server stumbles
on a transaction prepared (but not resolved) by the local server,
foreign server has two options - 1. consult the local server and
resolve 2. if the first options fails to get the status of xid or that
if that option is not workable, throw an error e.g. indoubt
transaction. There is probably more network traffic happening here.
Usually, the local server should be able to resolve the transaction
before any other transaction stumbles upon it. The overhead is
incurred only when necessary.

Yes, something like this could be done. It's pretty complicated, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#95

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#93)

Re: Transactions involving multiple postgres foreign servers

On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit.

It is true that we can do that, but I'm not sure whether it's the best design.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#96

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Robert Haas (#95)

3 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit.

It is true that we can do that, but I'm not sure whether it's the best design.

I'm not sure best design, too. We need to discuss more. But this is
not a particular feature for the sharing solution. The atomic commit
using 2PC is useful for other servers that can use 2PC, not only
postgres_fdw.

Attached latest 3 patches that incorporated review comments so far.
But recovery speed improvement that is discussed on another thread is
not incorporated yet.
Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_support_fdw_xact_v2.patchapplication/octet-stream; name=000_support_fdw_xact_v2.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index adab2f8..f1c7d69 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1419,6 +1419,48 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        If this parameter is set to zero (which is the default) and
+        <xref linkend="guc-atomic-foreign-transaction"> is enabled,
+        transactions involving foreign servers will not succeed, because foreign
+        transactions can not be prepared.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-atomic-foreign-transaction" xreflabel="atomic_foreign_transaction">
+      <term><varname>atomic_foreign_transaction</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>atomic_foreign_transaction</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+       When this parameter is enabled the transaction involving foreign server/s is
+       guaranteed to commit all or none of the changes to the foreign server/s.
+       The parameter can be set any time during the session. The value of this parameter
+       at the time of committing the transaction is used.
+       </para>
+
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..c86f650 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1700,5 +1700,87 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is enabled
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. In Two-phase commit protocol the commit
+    is processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>.
+    If any of the foreign server fails to prepare transaction, prepare phase fails.
+    In commit phase, all the prepared transactions are committed if prepare
+    phase has succeeded or rolled back if prepare phase fails to prepare
+    transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareInfo</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_PREPARE.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>HandleForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMIT_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORT_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>atomic_foreign_transaction</> is disabled, atomicity can not be
+    guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager calls
+    <function>HandleForeignTransaction</> to commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..b01ccf8
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 62ed1dc..c2f36c7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..08de460
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2115 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		prepare_id_provider;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier provider function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_id_provider = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ * The function is responsible for pre-commit processing on foreign connections.
+ * The foreign transactions are prepared on the foreign servers which can
+ * execute two-phase-commit protocol. Those will be aborted or committed after
+ * the current transaction has been aborted or committed resp. We try to commit
+ * transactions on rest of the foreign servers now. For these foreign servers
+ * it is possible that some transactions commit even if the local transaction
+ * aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here, foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only forign
+	 * servers that can execute two-phase-commit protocol. We dont' need to use
+	 * two-phase-commit protocol when there is only one foreign server that
+	 * that can execute two-phase-commit.
+	 */
+	if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+	else
+	{
+		/*
+		 * Prepare the transactions on the foreign servers, which can execute
+		 * two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+}
+
+/*
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_info;
+		int				fdw_xact_info_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->prepare_id_provider);
+		fdw_xact_info = fdw_conn->prepare_id_provider(fdw_conn->serverid,
+													  fdw_conn->userid,
+													  &fdw_xact_info_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_info_len,
+									 fdw_xact_info);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_info_len,
+											fdw_xact_info))
+		{
+			/*
+			 * An error occured, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 * TODO:
+			 * FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and then persists it to the disk by creating a file under
+ * data/pg_fdw_xact directory.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	/*
+	 * The state file isn't valid yet, because we haven't written the correct
+	 * CRC yet.	 Before we do that, insert entry in WAL and flush it to disk.
+	 *
+	 * Between the time we have written the WAL entry and the time we write
+	 * out the correct state file CRC, we have an inconsistency: we have
+	 * recorded the foreign transaction in WAL but not on the disk. We
+	 * use a critical section to force a PANIC if we are unable to complete
+	 * the write --- then, WAL replay should repair the inconsistency.	The
+	 * odds of a PANIC actually occurring should be very tiny given that we
+	 * were able to write the bogus CRC above.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We have to set delayChkpt here, too; otherwise a checkpoint starting
+	 * immediately after the WAL record is inserted could complete without
+	 * fsync'ing our foreign transaction file. (This is essentially the same
+	 * kind of race condition as the COMMIT-to-clog-write case that
+	 * RecordTransactionCommit uses delayChkpt for; see notes there.)
+	 */
+	MyPgXact->delayChkpt = true;
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* If we crash now, we have prepared: WAL replay will fix things */
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	MyPgXact->delayChkpt = false;
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			/* Remove the file from the disk as well. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or commiting the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+		ForeignServer		*foreign_server;
+		ForeignDataWrapper	*fdw;
+		FdwRoutine			*fdw_routine;
+
+		foreign_server = GetForeignServer(fdw_xact->serverid);
+		fdw = GetForeignDataWrapper(foreign_server->fdwid);
+		fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+					fdw->fdwname);
+		return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, servroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	int             rec_len = XLogRecGetDataLen(record);
+	TransactionId xid = XLogRecGetXid(record);
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData       *fdw_xact_data_file = (FDWXactOnDiskData *)rec;
+		char                            path[MAXPGPATH];
+		int                                     fd;
+		pg_crc32c                       fdw_xact_crc;
+
+		/* Recompute CRC */
+		INIT_CRC32C(fdw_xact_crc);
+		COMP_CRC32C(fdw_xact_crc, rec, rec_len);
+		FIN_CRC32C(fdw_xact_crc);
+
+		FDWXactFilePath(path, xid, fdw_xact_data_file->serverid,
+						fdw_xact_data_file->userid);
+
+		/*
+		 * The file may exist, if it was flushed to disk after creating it. The
+		 * file might have been flushed while it was being crafted, so the
+		 * contents can not be guaranteed to be accurate. Hence truncate and
+		 * rewrite the file.
+		 */
+		fd = OpenTransientFile(path, O_CREAT | O_WRONLY | O_TRUNC | PG_BINARY,
+							   S_IRUSR | S_IWUSR);
+		if (fd < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not create/open foreign transaction state file \"%s\": %m",
+							path)));
+
+		/* The log record is exactly the contents of the file. */
+		if (write(fd, rec, rec_len) != rec_len)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write FDW transaction state file: %s", path)));
+		}
+
+		if (write(fd, &fdw_xact_crc, sizeof(pg_crc32c)) != sizeof(pg_crc32c))
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write two-phase state file: %m")));
+		}
+
+		/*
+		 * We must fsync the file because the end-of-replay checkpoint will not do
+		 * so, there being no foreign transaction entry in shared memory yet to
+		 * tell it to.
+		 */
+		if (pg_fsync(fd) != 0)
+		{
+			CloseTransientFile(fd);
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync foreign transaction state file: %m")));
+		}
+
+		CloseTransientFile(fd);
+	}
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+
+		RemoveFDWXactFile(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid, true);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints.
+ * The foreign transaction entries and hence the corresponding files are expected
+ * to be very short-lived. By executing this function at the end, we might have
+ * lesser files to fsync, thus reducing some I/O. This is similar to
+ * CheckPointTwoPhase().
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that neet to be copied to
+	 * disk, so we preform all I/O while holding FDWXactLock for simplicity.
+	 * This precents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoit yet is marked invalid,
+	 * bacause of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	Assert(!RecoveryInProgress());
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction file: %m")));
+	}
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * pg_fdw_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recoverying before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+	/*
+	 * Check the CRC.
+	 */
+
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file correspnding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * REcoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Alreadby synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..ad71c0e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604..734ed48 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e11b229..4956b3d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -186,6 +187,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1917,6 +1922,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1977,6 +1985,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2134,6 +2145,7 @@ CommitTransaction(void)
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2318,6 +2330,7 @@ PrepareTransaction(void)
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
@@ -2604,6 +2617,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6b1f24e..9e6aa75 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4909,6 +4910,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -5976,6 +5978,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transaction",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6662,7 +6667,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7278,6 +7286,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7463,6 +7472,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8738,6 +8750,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9172,7 +9189,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9193,6 +9211,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9208,6 +9227,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9396,6 +9416,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9588,6 +9609,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 3870a4d..fca709d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ada2142..7eaaa6d 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -254,6 +254,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index eb531af..9a10696 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..c0f000c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index c04b17f..74f10b7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -141,6 +142,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, BTreeShmemSize());
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -253,6 +255,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	BTreeShmemInit();
 	SyncScanShmemInit();
 	AsyncShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f8996cd..6589cfe 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -47,3 +47,4 @@ CommitTsLock						39
 ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
+FDWXactLock					43
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65660c1..1747065 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2061,6 +2062,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 7c2daa5..db10e83 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 2f92dfa..fc8cd53 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index c8a8c52..e69e6d0 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -204,6 +204,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3370966 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,5 +301,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 2b76f64..dda2d7a 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..d6ff550 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..42f1838
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..86448ff 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..2990e05 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..3413201 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 17ec71d..21d87e1 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5258,6 +5258,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+DATA(insert OID = 4111 ( pg_fdw_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..a11f5b6 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Supprot functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 7dc8dac..888a2b0 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -251,11 +251,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 90f5132..334663f 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1331,4 +1331,8 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index f2dedbb..8c65562 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

002_pg_fdw_resolver_v2.patchapplication/octet-stream; name=002_pg_fdw_resolver_v2.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..100f8fe
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,365 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL;
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}

001_pg_fdw_supports_2pc_v2.patchapplication/octet-stream; name=001_pg_fdw_supports_2pc_v2.patchDownload

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..63482dc 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -235,11 +262,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -370,15 +400,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -586,158 +623,269 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+
+		PGconn	*conn = entry->conn;
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the pepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -836,3 +984,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 224aed9..6a20c47 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -176,6 +177,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c79eacf 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1321,7 +1329,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1698,7 +1706,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2293,7 +2301,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2555,7 +2563,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3492,7 +3500,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3582,7 +3590,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3805,7 +3813,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..8409671 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);

#97

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Masahiko Sawada (#96)

Re: Transactions involving multiple postgres foreign servers

On Mon, Oct 31, 2016 at 6:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit.

It is true that we can do that, but I'm not sure whether it's the best design.

I'm not sure best design, too. We need to discuss more. But this is
not a particular feature for the sharing solution. The atomic commit
using 2PC is useful for other servers that can use 2PC, not only
postgres_fdw.

I think, we need to discuss the big picture i.e. architecture for
distributed transaction manager for PostgreSQL. Divide it in smaller
problems and then solve each of them as series of commits possibly
producing a useful feature with each commit. I think, what Robert is
pointing out is if we spend time solving smaller problems, we might
end up with something which can not be used to solve the bigger
problem. Instead, if we define the bigger problem and come up with
clear subproblems that when solved would solve the bigger problem, we
may not end up in such a situation.

There are many distributed transaction models discussed in various
papers like [1]http://link.springer.com/article/10.1007/s00778-014-0359-9, [2]https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf, [3]http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1713&context=cstech. We need to assess which one/s, would suit
PostgreSQL FDW infrastructure and may be specifically for
postgres_fdw. There is some discussion at [4]https://wiki.postgresql.org/wiki/DTM. It lists a few
approaches, but I could not find a discussion on pros and cons of each
of them, and a conclusion as to which of the approaches suits
PostgreSQL. May be we want to start that discussion.

I know that it's hard to come up with a single model that would suit
FDWs or would serve all kinds of applications. We may not be able to
support a full distributed transaction manager for every FDW out
there. It's possible that because of lack of the big picture, we will
not see anything happen in this area for another release. Given that
and since all of the models in those papers require 2PC as a basic
building block, I was of the opinion that we could at least start with
2PC implementation. But I think request for bigger picture is also
valid for reasons stated above.

Attached latest 3 patches that incorporated review comments so far.
But recovery speed improvement that is discussed on another thread is
not incorporated yet.
Please give me feedback.

[1]: http://link.springer.com/article/10.1007/s00778-014-0359-9
[2]: https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/1c0a12a383dd2cd8c125613300585c64/7684dd8109a5b3d5c1256de40051686f/$FILE/tdd99.pdf
[3]: http://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1713&context=cstech
[4]: https://wiki.postgresql.org/wiki/DTM

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#98

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#97)

Re: Transactions involving multiple postgres foreign servers

On Wed, Nov 2, 2016 at 9:22 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Oct 31, 2016 at 6:17 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Oct 28, 2016 at 3:19 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 26, 2016 at 2:00 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think we can consider the atomic commit and the atomic visibility
separately, and the atomic visibility can build on the top of the
atomic commit.

It is true that we can do that, but I'm not sure whether it's the best design.

I'm not sure best design, too. We need to discuss more. But this is
not a particular feature for the sharing solution. The atomic commit
using 2PC is useful for other servers that can use 2PC, not only
postgres_fdw.

I think, we need to discuss the big picture i.e. architecture for
distributed transaction manager for PostgreSQL. Divide it in smaller
problems and then solve each of them as series of commits possibly
producing a useful feature with each commit. I think, what Robert is
pointing out is if we spend time solving smaller problems, we might
end up with something which can not be used to solve the bigger
problem. Instead, if we define the bigger problem and come up with
clear subproblems that when solved would solve the bigger problem, we
may not end up in such a situation.

There are many distributed transaction models discussed in various
papers like [1], [2], [3]. We need to assess which one/s, would suit
PostgreSQL FDW infrastructure and may be specifically for
postgres_fdw. There is some discussion at [4]. It lists a few
approaches, but I could not find a discussion on pros and cons of each
of them, and a conclusion as to which of the approaches suits
PostgreSQL. May be we want to start that discussion.

Agreed. Let's start discussion.
I think that it's important to choose what type of transaction
coordination we employ; centralized or distributed.

I know that it's hard to come up with a single model that would suit
FDWs or would serve all kinds of applications. We may not be able to
support a full distributed transaction manager for every FDW out
there. It's possible that because of lack of the big picture, we will
not see anything happen in this area for another release. Given that
and since all of the models in those papers require 2PC as a basic
building block, I was of the opinion that we could at least start with
2PC implementation. But I think request for bigger picture is also
valid for reasons stated above.

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#99

Haribabu Kommi

kommi.haribabu@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#98)

Re: Transactions involving multiple postgres foreign servers

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

#100

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Haribabu Kommi (#99)

Re: Transactions involving multiple postgres foreign servers

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#101

Haribabu Kommi

kommi.haribabu@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#100)

Re: Transactions involving multiple postgres foreign servers

On Mon, Dec 5, 2016 at 4:42 PM, Ashutosh Bapat <
ashutosh.bapat@enterprisedb.com> wrote:

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

Thanks for the update.
I closed it in commitfest 2017-01 with "returned with feedback". Author can
update it once the new patch is submitted.

Regards,
Hari Babu
Fujitsu Australia

#102

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

about 9 years ago

In reply to: Ashutosh Bapat (#100)

Re: Transactions involving multiple postgres foreign servers

On 2016/12/05 14:42, Ashutosh Bapat wrote:

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

In the pgconf ASIA depelopers meeting Bruce Momjian and other developers
discussed
on FDW based sharding [1]https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting. The suggestions from other hackers was that
we need to discuss
the big picture and use cases of sharding. Bruce has listed all the
building blocks of built-in sharding
on wiki [2]https://wiki.postgresql.org/wiki/Built-in_Sharding. IIUC,transaction manager involving foreign servers is one
part of sharding.
As per the Bruce's wiki page there are two use cases for transactions
involved multiple foreign servers:
1. Cross-node read-only queries on read/write shards:
This will require a global snapshot manager to make sure the shards
return consistent data.
2. Cross-node read-write queries:
This will require a global snapshot manager and global transaction
manager.

I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK
PREPARED
involving foreign servers that will be good improvement.

[1]: https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting
[2]: https://wiki.postgresql.org/wiki/Built-in_Sharding

Regards,
Vinayak Pokale
NTT Opern Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#103

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: vinayak (#102)

Re: Transactions involving multiple postgres foreign servers

On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2016/12/05 14:42, Ashutosh Bapat wrote:

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

In the pgconf ASIA depelopers meeting Bruce Momjian and other developers
discussed
on FDW based sharding [1]. The suggestions from other hackers was that we
need to discuss
the big picture and use cases of sharding. Bruce has listed all the building
blocks of built-in sharding
on wiki [2]. IIUC,transaction manager involving foreign servers is one part
of sharding.

Yeah, the 2PC on FDW is a basic building block for FDW based sharding
and it would be useful not only FDW sharding but also other purposes.
As far as I surveyed some papers the many kinds of distributed
transaction management architectures use the 2PC for atomic commit
with some optimisations. And using 2PC to provide atomic commit on
distributed transaction has much affinity with current PostgreSQL
implementation from some perspective.

As per the Bruce's wiki page there are two use cases for transactions
involved multiple foreign servers:
1. Cross-node read-only queries on read/write shards:
This will require a global snapshot manager to make sure the shards
return consistent data.
2. Cross-node read-write queries:
This will require a global snapshot manager and global transaction
manager.

I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK
PREPARED
involving foreign servers that will be good improvement.

[1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting
[2] https://wiki.postgresql.org/wiki/Built-in_Sharding

I also agree to work on implementing the atomic commit across the
foreign servers and then continue to work on the more larger problem.
I think that this will be large step forward. I'm going to submit the
updated version patch to CF3.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#104

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#103)

3 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Dec 9, 2016 at 4:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2016/12/05 14:42, Ashutosh Bapat wrote:

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

In the pgconf ASIA depelopers meeting Bruce Momjian and other developers
discussed
on FDW based sharding [1]. The suggestions from other hackers was that we
need to discuss
the big picture and use cases of sharding. Bruce has listed all the building
blocks of built-in sharding
on wiki [2]. IIUC,transaction manager involving foreign servers is one part
of sharding.

Yeah, the 2PC on FDW is a basic building block for FDW based sharding
and it would be useful not only FDW sharding but also other purposes.
As far as I surveyed some papers the many kinds of distributed
transaction management architectures use the 2PC for atomic commit
with some optimisations. And using 2PC to provide atomic commit on
distributed transaction has much affinity with current PostgreSQL
implementation from some perspective.

As per the Bruce's wiki page there are two use cases for transactions
involved multiple foreign servers:
1. Cross-node read-only queries on read/write shards:
This will require a global snapshot manager to make sure the shards
return consistent data.
2. Cross-node read-write queries:
This will require a global snapshot manager and global transaction
manager.

I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK
PREPARED
involving foreign servers that will be good improvement.

[1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting
[2] https://wiki.postgresql.org/wiki/Built-in_Sharding

I also agree to work on implementing the atomic commit across the
foreign servers and then continue to work on the more larger problem.
I think that this will be large step forward. I'm going to submit the
updated version patch to CF3.

Attached latest version patches. Almost design is the same as previous
patches and I incorporated some optimisations and updated
documentation. But the documentation and regression test is not still
enough.

000 patch adds some new FDW APIs to achive the atomic commit involving
the foreign servers using two-phase-commit. If more than one foreign
servers involve with the transaction or the transaction changes local
data and involves even one foreign server, local node executes PREPARE
and COMMIT/ROLLBACK PREPARED on foreign servers at commit. A lot of
part of this implementation is inspired by two phase commit code. So I
incorporated recent changes of two phase commit code, for example
recovery speed improvement, into this patch.
001 patch makes postgres_fdw support atomic commit. If
two_phase_commit is set 'on' to a foreign server, the two-phase-commit
will be used at commit. 002 patch adds the pg_fdw_resolver new contrib
module that is a bgworker process that resolves the in-doubt
transaction on foreign server if there is.

The reply might be late next week but feedback and review comment are
very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_support_fdw_xact_v3.patchtext/x-diff; charset=US-ASCII; name=000_support_fdw_xact_v3.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 1b98c41..d4882da 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1417,6 +1417,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..a5ddbca 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1700,5 +1700,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..6e7aac7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 62ed1dc..c2f36c7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..2891bca
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2226 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_fdw_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign server/s involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)record;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..ad71c0e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5415604..734ed48 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d643216..e9a9919 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -114,6 +115,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -187,6 +190,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1918,6 +1925,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1978,6 +1988,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2135,6 +2148,7 @@ CommitTransaction(void)
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2156,6 +2170,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2319,6 +2335,7 @@ PrepareTransaction(void)
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
 	AtPrepare_RelationMap();
+	AtPrepare_FDWXacts();
 
 	/*
 	 * Here is where we really truly prepare.
@@ -2427,6 +2444,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2608,9 +2627,12 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4294,6 +4316,32 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to forget */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index aa9ee5a..105e1f6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4905,6 +4906,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -5972,6 +5974,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6658,7 +6663,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7274,6 +7282,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7459,6 +7468,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8736,6 +8748,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9170,7 +9187,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9191,6 +9209,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9206,6 +9225,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9394,6 +9414,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9436,6 +9457,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9586,6 +9608,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5c5ba7b..e784f88 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 649cef8..9b78860 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -277,6 +277,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index eb531af..9a10696 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index a954610..8e0b2ed 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -423,6 +423,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -687,6 +690,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -982,6 +988,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 46cd5ba..c0f000c 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 29febb4..af222da 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -148,6 +149,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index dd04182..0285f5b 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -48,3 +48,4 @@ ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 BackendRandomLock					43
+FDWXactLock					44
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a025117..8a3b9be 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2061,6 +2062,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 2c638b2..78b8561 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index adcebe2..5b403f6 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 24f9cc8..9fc424a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -204,6 +204,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3370966 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,5 +301,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 2b76f64..dda2d7a 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..d6ff550 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..a556280
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,79 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index a7a0ae2..86448ff 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index a123d2a..76bbbfd 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
@@ -356,6 +359,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index ceb0462..2990e05 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 0bc41ab..3413201 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index cd7b909..b44f781 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5262,6 +5262,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..74cf69f 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 0344f42..3798321 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -254,11 +254,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 7ed1623..09b8269 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1332,4 +1332,8 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index f2dedbb..8c65562 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

001_pgfdw_support_atomic_commit_v3.patchtext/x-diff; charset=US-ASCII; name=001_pgfdw_support_atomic_commit_v3.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 5fabc99..d89ead6 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -235,11 +262,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -370,15 +400,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -586,158 +623,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -836,3 +999,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 785f520..a9fb3f7 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -6972,3 +7000,176 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'of');
+ERROR:  two_phase_commit requires a Boolean value
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw')                                                                                                                                                                                                                                                                                                                                                         | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one not supporting server
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One not supporting server and one supporting server
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server and one not supporting server.
+BEGIN;
+INSERT INTO ft7 VALUES(104);
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   106
+(1 row)
+
+-- one local and one not supporting foreign server
+BEGIN;
+INSERT INTO ft7 VALUES(107);
+INSERT INTO "S 1"."T 6" VALUES (1);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   107
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- one local and one supporting foreign server and not supporting one
+BEGIN;
+INSERT INTO ft7 VALUES(108);
+INSERT INTO ft8 VALUES(109);
+INSERT INTO "S 1"."T 6" VALUES (2);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   109
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     2
+(1 row)
+
+-- one local and two supporting foreign server and not supporting one
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   112
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     3
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   112
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   112
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     3
+(1 row)
+
+-- violation on foreign server supporting 2PC
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   112
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   112
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     3
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 224aed9..6a20c47 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -176,6 +177,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fbe6929..a398498 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1321,7 +1329,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1698,7 +1706,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2293,7 +2301,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2555,7 +2563,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3492,7 +3500,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3582,7 +3590,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3805,7 +3813,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..8409671 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..4ef2e51 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1660,3 +1690,95 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one not supporting server
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One not supporting server and one supporting server
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server and one not supporting server.
+BEGIN;
+INSERT INTO ft7 VALUES(104);
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- one local and one not supporting foreign server
+BEGIN;
+INSERT INTO ft7 VALUES(107);
+INSERT INTO "S 1"."T 6" VALUES (1);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- one local and one supporting foreign server and not supporting one
+BEGIN;
+INSERT INTO ft7 VALUES(108);
+INSERT INTO ft8 VALUES(109);
+INSERT INTO "S 1"."T 6" VALUES (2);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- one local and two supporting foreign server and not supporting one
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";

002_pg_fdw_resolver_contrib_v3.patchtext/x-diff; charset=US-ASCII; name=002_pg_fdw_resolver_contrib_v3.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..100f8fe
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,365 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL;
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}

#105

Masahiko Sawada

sawada.mshk@gmail.com

about 9 years ago

In reply to: Masahiko Sawada (#104)

Re: Transactions involving multiple postgres foreign servers

On Fri, Dec 23, 2016 at 1:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2016 at 4:02 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Dec 9, 2016 at 3:02 PM, vinayak <Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2016/12/05 14:42, Ashutosh Bapat wrote:

On Mon, Dec 5, 2016 at 11:04 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Fri, Nov 11, 2016 at 5:38 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

2PC is a basic building block to support the atomic commit and there
are some optimizations way in order to reduce disadvantage of 2PC. As
you mentioned, it's hard to support a single model that would suit
several type of FDWs. But even if it's not a purpose for sharding,
because many other database which could be connected to PostgreSQL via
FDW supports 2PC, 2PC for FDW would be useful for not only sharding
purpose. That's why I was focusing on implementing 2PC for FDW so far.

Moved to next CF with "needs review" status.

I think this should be changed to "returned with feedback.". The
design and approach itself needs to be discussed. I think, we should
let authors decide whether they want it to be added to the next
commitfest or not.

When I first started with this work, Tom had suggested me to try to
make PREPARE and COMMIT/ROLLBACK PREPARED involving foreign servers or
at least postgres_fdw servers work. I think, most of my work that
Vinayak and Sawada have rebased to the latest master will be required
for getting what Tom suggested done. We wouldn't need a lot of changes
to that design. PREPARE involving foreign servers errors out right
now. If we start supporting prepared transactions involving foreign
servers that will be a good improvement over the current status-quo.
Once we get that done, we can continue working on the larger problem
of supporting ACID transactions involving foreign servers.

In the pgconf ASIA depelopers meeting Bruce Momjian and other developers
discussed
on FDW based sharding [1]. The suggestions from other hackers was that we
need to discuss
the big picture and use cases of sharding. Bruce has listed all the building
blocks of built-in sharding
on wiki [2]. IIUC,transaction manager involving foreign servers is one part
of sharding.

Yeah, the 2PC on FDW is a basic building block for FDW based sharding
and it would be useful not only FDW sharding but also other purposes.
As far as I surveyed some papers the many kinds of distributed
transaction management architectures use the 2PC for atomic commit
with some optimisations. And using 2PC to provide atomic commit on
distributed transaction has much affinity with current PostgreSQL
implementation from some perspective.

As per the Bruce's wiki page there are two use cases for transactions
involved multiple foreign servers:
1. Cross-node read-only queries on read/write shards:
This will require a global snapshot manager to make sure the shards
return consistent data.
2. Cross-node read-write queries:
This will require a global snapshot manager and global transaction
manager.

I agree with you that if we start supporting PREPARE and COMMIT/ROLLBACK
PREPARED
involving foreign servers that will be good improvement.

[1] https://wiki.postgresql.org/wiki/PgConf.Asia_2016_Developer_Meeting
[2] https://wiki.postgresql.org/wiki/Built-in_Sharding

I also agree to work on implementing the atomic commit across the
foreign servers and then continue to work on the more larger problem.
I think that this will be large step forward. I'm going to submit the
updated version patch to CF3.

Attached latest version patches. Almost design is the same as previous
patches and I incorporated some optimisations and updated
documentation. But the documentation and regression test is not still
enough.

000 patch adds some new FDW APIs to achive the atomic commit involving
the foreign servers using two-phase-commit. If more than one foreign
servers involve with the transaction or the transaction changes local
data and involves even one foreign server, local node executes PREPARE
and COMMIT/ROLLBACK PREPARED on foreign servers at commit. A lot of
part of this implementation is inspired by two phase commit code. So I
incorporated recent changes of two phase commit code, for example
recovery speed improvement, into this patch.
001 patch makes postgres_fdw support atomic commit. If
two_phase_commit is set 'on' to a foreign server, the two-phase-commit
will be used at commit. 002 patch adds the pg_fdw_resolver new contrib
module that is a bgworker process that resolves the in-doubt
transaction on foreign server if there is.

The reply might be late next week but feedback and review comment are
very welcome.

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Parameters
==========
The patch introduces max_foreign_prepared_transactions parameter and
two_phase_commit parameter.

two_phase_commit parameter is a new foreign server parameter, which
means that specified foreign server is capable of two phase commit
protocol. The modification transaction could be committed using two
phase commit protocol on foreign server with two_phase_commit = on. We
can set this parameter by CREATE/ALTER SERVER command.

max_foreign_prepared_transactions is a new GUC parameter, which
controls the upper bound of the number of transaction on foreign
servers the local transaction prepares. Note that it does not control
the number of transactions on local server that involves foreign
server. Since one transaction could prepare transaction on multiple
foreign servers, max_foreign_prepared_transactions should be set at
least ((max_connections) * (the number of foreign server with
two_phase_commit = on)). Changing this parameter requires restart.

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol. If the transaction modifies data
on multiple foreign servers and does COMMIT then the transaction is
committed or rollback-ed on foreign servers using two phase commit
protocol implicitly.

Transaction is committed or rollback-ed using two phase commit
protocol in following cases.
* The transaction changes local data.
* The transaction changes data on more than one foreign server whose
two_phase_commit is on.

In order to manage foreign transaction, the patch changes PostgreSQL
core so that it keeps track of foreign transaction. These entry is
exists on shared buffer but it's written to fdw_xact file in
$PGDATA/fdw_xact directory by checkpoint. We can check all foreign
transaction entries via pg_fdw_xacts system view.

The commit of distributed transaction using two phase commit protocol
is executed as follows;

In 1st phase, every foreign server with two_phase_commit = on needs to
register the connection to MyFDWConnection while starting new
transaction on a foreign connection using RegisterXactForeignServer().
During pre-commit phase following steps are executed.

1. Get transaction identifier used for PREPARE TRANSACTION on foreign servers.
2. Execute COMMIT on foreign server with two_phase_commit = off.
3. Register fdw_xact entry into shared memory and write
XLOG_FDW_XACT_INSERT WAL.
4. Execute PREPARE TRANSACTION on foreign server with two_phase_commit = on.

After that, local changes is committed (calls
RecordTransactionCommit()). Meantime of phase 1 and local commit, the
transaction could be failed due to serialization failure and
pre-commit of notify. In such case, all foreign transactions are
rollback-ed.

In 2nd phase, foreign transaction on foreign server with
two_phase_commit = off are already finished in 1st phase, so we focus
on only the foreign server with two_phase_commit = on. During commit
phase following steps are executed.

1. Resolve foreign prepared transaction.
2. Remove foreign transaction entry and write XLOG_FDW_XACT_REMOVE WAL.

In case server crashes after step 1 and before step 2, a resolved
foreign transaction will be considered unresolved when the local
server recovers or standby takes over the master. It will try to
resolve the prepared transaction again and should get an error from
foreign server.

Crash recovery
=============
During crash recovery, the fdw_xact entry are inserted to
KnownFDWXactList or removed from KnownFDWXact list when corresponding
WAL records are replayed. After the redo is done fdw_xact file is
re-created and then pg_fdw_xact directory is scanned for unresolved
foreign prepared transactions.

The files in this directory are named as triplet (xid, foreign server
oid, user oid) to create a unique name for each file. This scan also
emits the oldest transaction id with an unresolved prepared foreign
transactions. This affects oldest active transaction id, since the
status of this transaction id is required to decide the fate of
unresolved prepared foreign transaction. On standby during WAL replay
files are just inserted or removed. If the standby is required to
finish recovery and take over the master, pg_fdw_xact is scanned to
read unresolved foreign prepared transactions into the shared memory.

Many of fdw_xact.c code is inspired by two_phase.c code. So recovery
mechanism and process are almost same as two_phase. The patch
incorporated recent optimization of two_phase.c.

Handling in-doubt transaction
========================
Any crash or connection failure in phase 2 leaves the prepared
transaction in unresolved state (called the in-doubt transaction). We
need to resolve the in-doubt transaction after foreign server
recovered. We can do that manually by calling pg_fdw_xact_resolve
function on local server but the patch introduces new contrib module
pg_fdw_resolver in order to handle them automatically. pg_fdw_resolver
is a background worker process, which periodically checks if there is
in-doubt transaction and tries to resolve such transaction.

FDW APIs
======
The patch introduces new four FDW APIs; GetPrepreId,
EndForeignTransaction, PrepareForeignTransaction and
ResolvePrepareForeginTransaction

* GetPreparedId is called to get transaction identifier on pre-commit phase.
* EndForeignTransaction is called on commit phase and executes either
COMMIT or ROLLBACK on foreign server.
* PrepareForeignTransaciton is called on pre-commit phase and executes
PREPARE TRANSACTION on foreign server.
* ResolvePrepareForeginTransaction is called on commit phase and
execute either COMMIT PREPARED or ROLLBACK PREPARED with given
transaction identifier on foreign server.

If the foreign server is not capable of two phase commit, last two
APIs are not required.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#106

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 9 years ago

In reply to: Masahiko Sawada (#105)

Re: Transactions involving multiple postgres foreign servers

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#107

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Ashutosh Bapat (#106)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

Sorry for confuse you. I'm still focusing on solving only that
problem. What I was trying to say is that I think that supporting
PREPARED transaction involving foreign server is the means, not the
end. So once we supports PREPARED transaction involving foreign
servers we can achieve cluster-wide atomic commit in a sense.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#108

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#107)

4 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

Sorry for confuse you. I'm still focusing on solving only that
problem. What I was trying to say is that I think that supporting
PREPARED transaction involving foreign server is the means, not the
end. So once we supports PREPARED transaction involving foreign
servers we can achieve cluster-wide atomic commit in a sense.

Attached updated patches. I fixed some bugs and add 003 patch that
adds TAP test for foreign transaction.
003 patch depends 000 and 001 patch.

Please give me feedback.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_support_fdw_xact_v4.patchbinary/octet-stream; name=000_support_fdw_xact_v4.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 07afa3c..638f910 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1431,6 +1431,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..a5ddbca 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1700,5 +1700,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..5c35bd1
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..46307d7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..c9d5b4b
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2228 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2016, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_fdw_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+			XLogRecPtr			recptr;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+			XLogFlush(recptr);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * get_dbids_with_unresolved_xact
+ * returns the oids of the databases containing unresolved foreign transactions.
+ * The function is used by pg_fdw_xact_resolver extension. Returns NIL if
+ * no such entry exists.
+ */
+List *
+get_dbids_with_unresolved_xact(void)
+{
+	int		cnt_xact;
+	List	*dbid_list = NIL;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+	for (cnt_xact = 0; cnt_xact < FDWXactGlobal->num_fdw_xacts; cnt_xact++)
+	{
+		FDWXact	fdw_xact;
+
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt_xact];
+
+		/* Skip locked entry as someone must be working on it */
+		if (fdw_xact->locking_backend == InvalidBackendId)
+			dbid_list = list_append_unique_oid(dbid_list, fdw_xact->dboid);
+	}
+	LWLockRelease(FDWXactLock);
+
+	return dbid_list;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)XLogRecGetData(record);
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..ad71c0e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5b72c1d..9e883a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f5346f0..4de49f5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -114,6 +115,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -187,6 +190,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1918,6 +1925,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1978,6 +1988,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2135,6 +2148,7 @@ CommitTransaction(void)
 	AtEOXact_HashTables(true);
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2156,6 +2170,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2222,6 +2238,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2427,6 +2446,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2608,9 +2629,12 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4294,6 +4318,32 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to forget */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70edafa..8e70fc7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4935,6 +4936,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -6002,6 +6004,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6688,7 +6693,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7304,6 +7312,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7490,6 +7499,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8797,6 +8809,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9234,7 +9251,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9255,6 +9273,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9270,6 +9289,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9458,6 +9478,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9500,6 +9521,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9650,6 +9672,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 31aade1..00ca32c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -277,6 +277,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 476a023..a975376 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4692427..f37ff7d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -435,6 +435,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -699,6 +702,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -996,6 +1002,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 00f5ae9..e53dfb7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -148,6 +149,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -267,6 +269,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index dd04182..0285f5b 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -48,3 +48,4 @@ ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 BackendRandomLock					43
+FDWXactLock					44
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4e2bd4c..8798625 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2051,6 +2052,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 15669b8..ea4308d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 1aaadc1..2ff5768 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1e7d677..412e506 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -204,6 +204,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3370966 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,5 +301,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 963802e..42c9942 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..d6ff550 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..a556280
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,79 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+/* For the sake of foreign transaction resolver */
+extern List	*get_dbids_with_unresolved_xact(void);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 5f76749..db28498 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 4df6529..a969696 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
@@ -356,6 +359,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index e0fcd05..2d2b117 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 23731e9..3920cce 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 37e022d..cb4fb10 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5262,6 +5262,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..565aa1b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 398fa8a..4b9c9af 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -254,11 +254,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index e1bb344..ff71b42 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -1332,4 +1332,8 @@ extern Datum pg_prepared_statement(PG_FUNCTION_ARGS);
 /* utils/mmgr/portalmem.c */
 extern Datum pg_cursor(PG_FUNCTION_ARGS);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e9cfadb..02d8e5d 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index d4d00d9..a1086d4 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

001_pgfdw_support_atomic_commit_v4.patchbinary/octet-stream; name=001_pgfdw_support_atomic_commit_v4.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..cc2b2c6 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -235,11 +262,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -370,15 +400,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -586,158 +623,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -836,3 +999,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0045f3f..beb072f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7059,3 +7087,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 552b333..1795a76 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -107,7 +107,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -176,6 +177,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 64f857f..6a1a5c2 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1321,7 +1329,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1698,7 +1706,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2293,7 +2301,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2555,7 +2563,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3492,7 +3500,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3582,7 +3590,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3805,7 +3813,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..ff57e98 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 9191776..49b6262 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1683,3 +1713,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";

002_pg_fdw_resolver_contrib_v4.patchbinary/octet-stream; name=002_pg_fdw_resolver_contrib_v4.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..100f8fe
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,365 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL;
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}

003_regression_test_for_fdw_xact_v4.patchbinary/octet-stream; name=003_regression_test_for_fdw_xact_v4.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/009_fdw_xact.pl b/src/test/recovery/t/009_fdw_xact.pl
new file mode 100644
index 0000000..79711bc
--- /dev/null
+++ b/src/test/recovery/t/009_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("maseter");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");

#109

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

almost 9 years ago

In reply to: Masahiko Sawada (#108)

Re: Transactions involving multiple postgres foreign servers

On 2017/01/16 17:35, Masahiko Sawada wrote:

On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

Sorry for confuse you. I'm still focusing on solving only that
problem. What I was trying to say is that I think that supporting
PREPARED transaction involving foreign server is the means, not the
end. So once we supports PREPARED transaction involving foreign
servers we can achieve cluster-wide atomic commit in a sense.

Attached updated patches. I fixed some bugs and add 003 patch that
adds TAP test for foreign transaction.
003 patch depends 000 and 001 patch.

Please give me feedback.

I have tested prepared transactions with foreign servers but after
preparing the transaction
the following error occur infinitely.
Test:
=====
=#BEGIN;
=#INSERT INTO ft1_lt VALUES (10);
=#INSERT INTO ft2_lt VALUES (20);
=#PREPARE TRANSACTION 'prep_xact_with_fdw';

2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve()
does not exist at character 8
2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given
name and argument types. You might need to add explicit type casts.
2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve()
2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign
transaction resolver (dbid 13119) (PID 4312) exited with exit code 1
.....

13119 | 1688 | 16388 | 10 | prepared |
px_2102366504_16388_10
13119 | 1688 | 16389 | 10 | prepared |
px_749056984_16389_10
(2 rows)

I think this is a bug.

Regards,
Vinayak Pokale
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#110

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: vinayak (#109)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 19, 2017 at 4:04 PM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2017/01/16 17:35, Masahiko Sawada wrote:

On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

Sorry for confuse you. I'm still focusing on solving only that
problem. What I was trying to say is that I think that supporting
PREPARED transaction involving foreign server is the means, not the
end. So once we supports PREPARED transaction involving foreign
servers we can achieve cluster-wide atomic commit in a sense.

Attached updated patches. I fixed some bugs and add 003 patch that
adds TAP test for foreign transaction.
003 patch depends 000 and 001 patch.

Please give me feedback.

I have tested prepared transactions with foreign servers but after preparing
the transaction
the following error occur infinitely.
Test:
=====
=#BEGIN;
=#INSERT INTO ft1_lt VALUES (10);
=#INSERT INTO ft2_lt VALUES (20);
=#PREPARE TRANSACTION 'prep_xact_with_fdw';

2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve() does
not exist at character 8
2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given name
and argument types. You might need to add explicit type casts.
2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve()
2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign
transaction resolver (dbid 13119) (PID 4312) exited with exit code 1
.....

If we check the status on another session then it showing the status as
prepared.
=# select * from pg_fdw_xacts;
dbid | transaction | serverid | userid | status | identifier
-------+-------------+----------+--------+----------+------------------------
13119 | 1688 | 16388 | 10 | prepared | px_2102366504_16388_10
13119 | 1688 | 16389 | 10 | prepared | px_749056984_16389_10
(2 rows)

I think this is a bug.

Thank you for reviewing!

I think this is a bug of pg_fdw_resolver contrib module. I had
forgotten to change the SQL executed by pg_fdw_resolver process.
Attached latest version 002 patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

002_pg_fdw_resolver_contrib_v5.patchbinary/octet-stream; name=002_pg_fdw_resolver_contrib_v5.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..def8752
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,365 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions.
+ *
+ * Copyright (C) 2016, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+
+/* How frequently the resolver demon checks for unresolved transactions? */
+#define FDW_XACT_RESOLVE_NAP_TIME (10 * 1000L)
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 0;	/* restart immediately */
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		List	*dbid_list = NIL;
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle)
+		{
+			/*
+			 * If we do not know which databases have foreign servers with
+			 * unresolved foreign transactions, get the list.
+			 */
+			if (!dbid_list)
+				dbid_list = get_dbids_with_unresolved_xact();
+
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				Oid dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_first(dbid_list);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+		}
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   FDW_XACT_RESOLVE_NAP_TIME,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * In case of a SIGHUP, just reload the configuration.
+		 */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT pg_fdw_xact_resolve()";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN);
+	pqsignal(SIGUSR2, SIG_IGN);
+
+	/* Reset some signals that are accepted by postmaster but not here */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}

#111

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#110)

4 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 19, 2017 at 5:44 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 19, 2017 at 4:04 PM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2017/01/16 17:35, Masahiko Sawada wrote:

On Fri, Jan 13, 2017 at 3:48 PM, Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

On Fri, Jan 13, 2017 at 3:20 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

Long time passed since original patch proposed by Ashutosh, so I
explain again about current design and functionality of this feature.
If you have any question, please feel free to ask.

Thanks for the summary.

Parameters
==========

[ snip ]

Cluster-wide atomic commit
=======================
Since the distributed transaction commit on foreign servers are
executed independently, the transaction that modified data on the
multiple foreign servers is not ensured that transaction did either
all of them commit or all of them rollback. The patch adds the
functionality that guarantees distributed transaction did either
commit or rollback on all foreign servers. IOW the goal of this patch
is achieving the cluster-wide atomic commit across foreign server that
is capable two phase commit protocol.

In [1], I proposed that we solve the problem of supporting PREPARED
transactions involving foreign servers and in subsequent mail Vinayak
agreed to that. But this goal has wider scope than that proposal. I am
fine widening the scope, but then it would again lead to the same
discussion we had about the big picture. May be you want to share
design (or point out the parts of this design that will help) for
solving smaller problem and tone down the patch for the same.

Sorry for confuse you. I'm still focusing on solving only that
problem. What I was trying to say is that I think that supporting
PREPARED transaction involving foreign server is the means, not the
end. So once we supports PREPARED transaction involving foreign
servers we can achieve cluster-wide atomic commit in a sense.

Attached updated patches. I fixed some bugs and add 003 patch that
adds TAP test for foreign transaction.
003 patch depends 000 and 001 patch.

Please give me feedback.

I have tested prepared transactions with foreign servers but after preparing
the transaction
the following error occur infinitely.
Test:
=====
=#BEGIN;
=#INSERT INTO ft1_lt VALUES (10);
=#INSERT INTO ft2_lt VALUES (20);
=#PREPARE TRANSACTION 'prep_xact_with_fdw';

2017-01-18 15:09:48.378 JST [4312] ERROR: function pg_fdw_resolve() does
not exist at character 8
2017-01-18 15:09:48.378 JST [4312] HINT: No function matches the given name
and argument types. You might need to add explicit type casts.
2017-01-18 15:09:48.378 JST [4312] QUERY: SELECT pg_fdw_resolve()
2017-01-18 15:09:48.378 JST [29224] LOG: worker process: foreign
transaction resolver (dbid 13119) (PID 4312) exited with exit code 1
.....

If we check the status on another session then it showing the status as
prepared.
=# select * from pg_fdw_xacts;
dbid | transaction | serverid | userid | status | identifier
-------+-------------+----------+--------+----------+------------------------
13119 | 1688 | 16388 | 10 | prepared | px_2102366504_16388_10
13119 | 1688 | 16389 | 10 | prepared | px_749056984_16389_10
(2 rows)

I think this is a bug.

Thank you for reviewing!

I think this is a bug of pg_fdw_resolver contrib module. I had
forgotten to change the SQL executed by pg_fdw_resolver process.
Attached latest version 002 patch.

As previous version patch conflicts to current HEAD, attached updated
version patches. Also I fixed some bugs in pg_fdw_xact_resolver and
added some documentations.
Please review it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_support_fdw_xact_v6.patchtext/x-patch; charset=US-ASCII; name=000_support_fdw_xact_v6.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index fb5d647..27ed724 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1431,6 +1431,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..a5ddbca 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1700,5 +1700,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..5c35bd1
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..46307d7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..ed6dcc6
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2200 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_fdw_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+			XLogRecPtr			recptr;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+			XLogFlush(recptr);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)XLogRecGetData(record);
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9bb1362..ad71c0e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 5b72c1d..9e883a3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f6f136d..b30b0ea 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -115,6 +116,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -188,6 +191,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1919,6 +1926,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1979,6 +1989,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2137,6 +2150,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2158,6 +2172,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2224,6 +2240,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2429,6 +2448,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2610,9 +2631,12 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4296,6 +4320,32 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to forget */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2f5d603..a7eb92f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -4961,6 +4962,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -6028,6 +6030,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6714,7 +6719,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7330,6 +7338,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7516,6 +7525,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8823,6 +8835,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9260,7 +9277,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9281,6 +9299,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9296,6 +9315,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9484,6 +9504,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9526,6 +9547,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9676,6 +9698,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 4dfedf8..02def2f 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -286,6 +286,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 476a023..a975376 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1087,6 +1088,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	simple_heap_delete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1385,6 +1400,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e356039..04db1c2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -436,6 +436,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -697,6 +700,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -994,6 +1000,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index c95ca5b..57cba91 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepLauncherLock				44
 LogicalRepWorkerLock				45
+FDWXactLock					46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 74ca4e7..a6d28c2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/transam.h"
 #include "access/twophase.h"
@@ -2053,6 +2054,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 661b0fa..da979c5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 1aaadc1..2ff5768 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 443c2ee..19094d6 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -205,6 +205,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3370966 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,5 +301,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c
index 963802e..42c9942 100644
--- a/src/bin/pg_resetxlog/pg_resetxlog.c
+++ b/src/bin/pg_resetxlog/pg_resetxlog.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_xlogdump/rmgrdesc.c b/src/bin/pg_xlogdump/rmgrdesc.c
index 8fe20ce..d6ff550 100644
--- a/src/bin/pg_xlogdump/rmgrdesc.c
+++ b/src/bin/pg_xlogdump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/gin.h"
 #include "access/gist_private.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..d326ac1
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 5f76749..db28498 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -44,6 +44,7 @@ PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL)
 PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 4df6529..a969696 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
@@ -356,6 +359,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 8ad4d47..17275e9 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 23731e9..3920cce 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index c1f492b..7b83bcb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5269,6 +5269,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..565aa1b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 398fa8a..4b9c9af 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -254,11 +254,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 5bdca82..750054c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -120,4 +120,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 60abcad..fafb58e 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index d4d00d9..a1086d4 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

001_pgfdw_support_atomic_commit_v6.patchtext/x-patch; charset=US-ASCII; name=001_pgfdw_support_atomic_commit_v6.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..cc2b2c6 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -235,11 +262,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -370,15 +400,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -586,158 +623,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -836,3 +999,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 3a09280..dfef888 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7053,3 +7081,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ce1f443..7dce102 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1321,7 +1329,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1698,7 +1706,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2293,7 +2301,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2555,7 +2563,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3492,7 +3500,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3582,7 +3590,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3805,7 +3813,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..ff57e98 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index e19a3ef..e913414 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1683,3 +1713,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index b31f373..21abe78 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>

002_pg_fdw_resolver_contrib_v6.patchtext/x-patch; charset=US-ASCII; name=002_pg_fdw_resolver_contrib_v6.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..13cd002
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,448 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("pg_fdw_xact_resolver.naptime",
+							"Time to sleep between pg_fdw_xact_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	*dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("pg_fdw_xact_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		int naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+
+				elog(WARNING, "status : %d, handle is NULL : %d", status,
+					handle == NULL);
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid dbid;
+
+			/* Get the database list if empty*/
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Work on the first dbid, and remove it from the list */
+			dbid = linitial_oid(dbid_list);
+			dbid_list = list_delete_oid(dbid_list, dbid);
+
+			Assert(OidIsValid(dbid));
+
+			/* Start the foreign transaction resolver */
+			worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+				BGWORKER_BACKEND_DATABASE_CONNECTION;
+			worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+			/* We will start another worker if needed */
+			worker.bgw_restart_time = BGW_NEVER_RESTART;
+			worker.bgw_main = FDWXactResolver_worker_main;
+			snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+			worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+			/* set bgw_notify_pid so that we can wait for it to finish */
+			worker.bgw_notify_pid = MyProcPid;
+
+			RegisterDynamicBackgroundWorker(&worker, &handle);
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+														   fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List *dblist = NIL;
+	ListCell *cell;
+	ListCell *next;
+	ListCell *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple tup;
+	Relation rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid dbid = lfirst_oid(cell);
+		bool exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index c8708ec..e2048ee 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -127,6 +127,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &passwordcheck;
  &pgbuffercache;
  &pgcrypto;
+ &pg-fdw-xact-resolver
  &pgfreespacemap;
  &pgprewarm;
  &pgrowlocks;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 2624c62..79d076c 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
+<!ENTITY pg-fdw-xact-resolver SYSTEM "pg-fdw-xact-resolver.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
 <!ENTITY pgprewarm       SYSTEM "pgprewarm.sgml">
 <!ENTITY pgrowlocks      SYSTEM "pgrowlocks.sgml">
diff --git a/doc/src/sgml/pg-fdw-xact-resolver.sgml b/doc/src/sgml/pg-fdw-xact-resolver.sgml
new file mode 100644
index 0000000..b47073c
--- /dev/null
+++ b/doc/src/sgml/pg-fdw-xact-resolver.sgml
@@ -0,0 +1,60 @@
+<!-- doc/src/sgml/pg-fdw-xact-resolver.sgml -->
+
+<sect1 id="pg-fdw-xact-resolver" xreflabel="pg_fdw_xact_resolver">
+ <title>pg_fdw_xact_resolver</title>
+
+ <indexterm zone="pg-fdw-xact-resolver">
+  <primary>pg_fdw_xact_resolver</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_fdw_xact_resolver</> module launches foreign transaction
+  resolver process to resolve unresolved transactions prepared on foreign
+  servers.
+ </para>
+
+ <para>
+  The transaction involving multiple foreign servers uses two-phase-commit
+  protocol when transaction commits. Any crash or connection failure
+  after transaction prepared but before commit leaves the preapred transaction
+  in unresolved state. To resolve such a dandling transaction, we need to
+  call <function>pg_fdw_xact_resolve</function>.
+ </para>
+
+ <para>
+  The foreign transaction resolver process launches separate bacground
+  worker process to resolve the dangling forign transaction in each
+  database. The process simply connects to the database as needed and
+  callls <function>pg_fdw_xact_resolve</function>. The launcher process
+  launches at most one worker at a time.
+ </para>
+ 
+ <sect2>
+  <title>Configuration Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>pg_fdw_xact_resovler.naptime</varname> (<type>integer</type>)
+    </term>
+
+    <listitem>
+     <para>
+      Specifies the minimum delay between foreign transaction resolver runs
+      on any given database. The dealy is measured in seconds, and the
+      default is one minute (1min). This parameter can only be set in
+      the <filename>postgresql.conf</filename> file of on the server
+      command line.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+  <para>
+   Ahutosh Bapat <email>ashutosh.bapat@enterprisedb.com</email>, Masahiko Sawada
+   <email>sawada.mshk@gmail.com</email>
+  </para>
+ </sect2>

003_regression_test_for_fdw_xact_v6.patchtext/x-patch; charset=US-ASCII; name=003_regression_test_for_fdw_xact_v6.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/009_fdw_xact.pl b/src/test/recovery/t/009_fdw_xact.pl
new file mode 100644
index 0000000..79711bc
--- /dev/null
+++ b/src/test/recovery/t/009_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("maseter");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");

#112

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

almost 9 years ago

In reply to: Masahiko Sawada (#111)

Re: Transactions involving multiple postgres foreign servers

Hi Sawada-san,

On 2017/01/26 16:51, Masahiko Sawada wrote:

Thank you for reviewing!

I think this is a bug of pg_fdw_resolver contrib module. I had
forgotten to change the SQL executed by pg_fdw_resolver process.
Attached latest version 002 patch.

As previous version patch conflicts to current HEAD, attached updated
version patches. Also I fixed some bugs in pg_fdw_xact_resolver and
added some documentations.
Please review it.

Thank you updating the patches.

I have applied patches on Postgres HEAD.
I have created the postgres=fdw extension in PostgreSQL and then I got
segmentation fault.*
**Details:*
=# 2017-01-26 17:52:56.156 JST [3411] LOG: worker process: foreign
transaction resolver launcher (PID 3418) was terminated by signal 11:
*Segmentation fault*
2017-01-26 17:52:56.156 JST [3411] LOG: terminating any other active
server processes
2017-01-26 17:52:56.156 JST [3425] WARNING: terminating connection
because of crash of another server process
2017-01-26 17:52:56.156 JST [3425] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
2017-01-26 17:52:56.156 JST [3425] HINT: In a moment you should be able
to reconnect to the database and repeat your command.

Is this a bug?

Regards,
Vinayak Pokale
NTT Open Source Software Center

#113

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: vinayak (#112)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 26, 2017 at 6:04 PM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

Hi Sawada-san,

On 2017/01/26 16:51, Masahiko Sawada wrote:

Thank you for reviewing!

I think this is a bug of pg_fdw_resolver contrib module. I had
forgotten to change the SQL executed by pg_fdw_resolver process.
Attached latest version 002 patch.

As previous version patch conflicts to current HEAD, attached updated
version patches. Also I fixed some bugs in pg_fdw_xact_resolver and
added some documentations.
Please review it.

Thank you updating the patches.

I have applied patches on Postgres HEAD.
I have created the postgres=fdw extension in PostgreSQL and then I got
segmentation fault.
Details:
=# 2017-01-26 17:52:56.156 JST [3411] LOG: worker process: foreign
transaction resolver launcher (PID 3418) was terminated by signal 11:
Segmentation fault
2017-01-26 17:52:56.156 JST [3411] LOG: terminating any other active server
processes
2017-01-26 17:52:56.156 JST [3425] WARNING: terminating connection because
of crash of another server process
2017-01-26 17:52:56.156 JST [3425] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit, because
another server process exited abnormally and possibly corrupted shared
memory.
2017-01-26 17:52:56.156 JST [3425] HINT: In a moment you should be able to
reconnect to the database and repeat your command.

Is this a bug?

Thank you for testing!

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

002_pg_fdw_resolver_contrib_v7.patchtext/x-patch; charset=US-ASCII; name=002_pg_fdw_resolver_contrib_v7.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..c57de0a
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,453 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("pg_fdw_xact_resolver.naptime",
+							"Time to sleep between pg_fdw_xact_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	*dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("pg_fdw_xact_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		int naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid dbid;
+
+			/* Get the database list if empty*/
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+														   fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List *dblist = NIL;
+	ListCell *cell;
+	ListCell *next;
+	ListCell *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple tup;
+	Relation rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry
+	 * from the list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid dbid = lfirst_oid(cell);
+		bool exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index c8708ec..e2048ee 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -127,6 +127,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &passwordcheck;
  &pgbuffercache;
  &pgcrypto;
+ &pg-fdw-xact-resolver
  &pgfreespacemap;
  &pgprewarm;
  &pgrowlocks;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 2624c62..79d076c 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
+<!ENTITY pg-fdw-xact-resolver SYSTEM "pg-fdw-xact-resolver.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
 <!ENTITY pgprewarm       SYSTEM "pgprewarm.sgml">
 <!ENTITY pgrowlocks      SYSTEM "pgrowlocks.sgml">
diff --git a/doc/src/sgml/pg-fdw-xact-resolver.sgml b/doc/src/sgml/pg-fdw-xact-resolver.sgml
new file mode 100644
index 0000000..b47073c
--- /dev/null
+++ b/doc/src/sgml/pg-fdw-xact-resolver.sgml
@@ -0,0 +1,60 @@
+<!-- doc/src/sgml/pg-fdw-xact-resolver.sgml -->
+
+<sect1 id="pg-fdw-xact-resolver" xreflabel="pg_fdw_xact_resolver">
+ <title>pg_fdw_xact_resolver</title>
+
+ <indexterm zone="pg-fdw-xact-resolver">
+  <primary>pg_fdw_xact_resolver</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_fdw_xact_resolver</> module launches foreign transaction
+  resolver process to resolve unresolved transactions prepared on foreign
+  servers.
+ </para>
+
+ <para>
+  The transaction involving multiple foreign servers uses two-phase-commit
+  protocol when transaction commits. Any crash or connection failure
+  after transaction prepared but before commit leaves the preapred transaction
+  in unresolved state. To resolve such a dandling transaction, we need to
+  call <function>pg_fdw_xact_resolve</function>.
+ </para>
+
+ <para>
+  The foreign transaction resolver process launches separate bacground
+  worker process to resolve the dangling forign transaction in each
+  database. The process simply connects to the database as needed and
+  callls <function>pg_fdw_xact_resolve</function>. The launcher process
+  launches at most one worker at a time.
+ </para>
+ 
+ <sect2>
+  <title>Configuration Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>pg_fdw_xact_resovler.naptime</varname> (<type>integer</type>)
+    </term>
+
+    <listitem>
+     <para>
+      Specifies the minimum delay between foreign transaction resolver runs
+      on any given database. The dealy is measured in seconds, and the
+      default is one minute (1min). This parameter can only be set in
+      the <filename>postgresql.conf</filename> file of on the server
+      command line.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+  <para>
+   Ahutosh Bapat <email>ashutosh.bapat@enterprisedb.com</email>, Masahiko Sawada
+   <email>sawada.mshk@gmail.com</email>
+  </para>
+ </sect2>

#114

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

almost 9 years ago

In reply to: Masahiko Sawada (#113)

Re: Transactions involving multiple postgres foreign servers

On 1/26/17 4:49 AM, Masahiko Sawada wrote:

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

So in some other thread we are talking about renaming "xlog", because
nobody knows what the "x" means. In the spirit of that, let's find
better names for new functions as well.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#115

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

almost 9 years ago

In reply to: Peter Eisentraut (#114)

Re: Transactions involving multiple postgres foreign servers

On 2017/01/29 0:11, Peter Eisentraut wrote:

On 1/26/17 4:49 AM, Masahiko Sawada wrote:

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

So in some other thread we are talking about renaming "xlog", because
nobody knows what the "x" means. In the spirit of that, let's find
better names for new functions as well.

Regards,
Vinayak Pokale
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#116

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 9 years ago

In reply to: Peter Eisentraut (#114)

Re: Transactions involving multiple postgres foreign servers

On Sat, Jan 28, 2017 at 8:41 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/26/17 4:49 AM, Masahiko Sawada wrote:

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

So in some other thread we are talking about renaming "xlog", because
nobody knows what the "x" means. In the spirit of that, let's find
better names for new functions as well.

It's common in English (not just the database jargon) to abbreviate
"trans" by "x" [1]https://en.wikipedia.org/wiki/X. xlog went a bit far by abbreviating whole
"transaction" by "x". But here "xact" means "transact", which is fine.
May be we should use 'X' instead of 'x', I don't know. Said that, I am
fine with any other name which conveys what the function does.

[1]: https://en.wikipedia.org/wiki/X

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#117

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Ashutosh Bapat (#116)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jan 30, 2017 at 12:50 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Sat, Jan 28, 2017 at 8:41 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 1/26/17 4:49 AM, Masahiko Sawada wrote:

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

So in some other thread we are talking about renaming "xlog", because
nobody knows what the "x" means. In the spirit of that, let's find
better names for new functions as well.

It's common in English (not just the database jargon) to abbreviate
"trans" by "x" [1]. xlog went a bit far by abbreviating whole
"transaction" by "x". But here "xact" means "transact", which is fine.
May be we should use 'X' instead of 'x', I don't know. Said that, I am
fine with any other name which conveys what the function does.

[1] https://en.wikipedia.org/wiki/X

"txn" can be used for abbreviation of "Transaction", so for example
pg_fdw_txn_resolver?
I'm also fine to change the module and function name.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#118

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#113)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jan 26, 2017 at 6:49 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Sorry, I attached wrong version patch of pg_fdw_xact_resovler. Please
use attached patch.

This patch has been moved to CF 2017-03.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#119

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#117)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"txn" can be used for abbreviation of "Transaction", so for example
pg_fdw_txn_resolver?
I'm also fine to change the module and function name.

If we're judging the relative clarity of various ways of abbreviating
the word "transaction", "txn" surely beats "x".

To repeat my usual refrain, is there any merit to abbreviating at all?
Could we call it, say, "fdw_transaction_resolver" or
"fdw_transaction_manager"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#120

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Robert Haas (#119)

Re: Transactions involving multiple postgres foreign servers

On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"txn" can be used for abbreviation of "Transaction", so for example
pg_fdw_txn_resolver?
I'm also fine to change the module and function name.

If we're judging the relative clarity of various ways of abbreviating
the word "transaction", "txn" surely beats "x".

To repeat my usual refrain, is there any merit to abbreviating at all?
Could we call it, say, "fdw_transaction_resolver" or
"fdw_transaction_manager"?

Almost modules in contrib are name with "pg_" prefix but I prefer
"fdw_transcation_resolver" if we don't need "pg_" prefix.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#121

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#120)

4 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Mon, Feb 6, 2017 at 10:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"txn" can be used for abbreviation of "Transaction", so for example
pg_fdw_txn_resolver?
I'm also fine to change the module and function name.

If we're judging the relative clarity of various ways of abbreviating
the word "transaction", "txn" surely beats "x".

To repeat my usual refrain, is there any merit to abbreviating at all?
Could we call it, say, "fdw_transaction_resolver" or
"fdw_transaction_manager"?

Almost modules in contrib are name with "pg_" prefix but I prefer
"fdw_transcation_resolver" if we don't need "pg_" prefix.

Since previous patches conflict to current HEAD, attached latest
version patches.
Please review them.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

003_regression_test_for_fdw_xact_v8.patchapplication/octet-stream; name=003_regression_test_for_fdw_xact_v8.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/009_fdw_xact.pl b/src/test/recovery/t/009_fdw_xact.pl
new file mode 100644
index 0000000..79711bc
--- /dev/null
+++ b/src/test/recovery/t/009_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("maseter");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");

000_support_fdw_xact_v8.patchapplication/octet-stream; name=000_support_fdw_xact_v8.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d..e2947df 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1431,6 +1431,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0c1db07..a5ddbca 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1700,5 +1700,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension pg_fdw_xact_resolver
+    and including $libdir/pg_fdw_xact_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..5c35bd1
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..46307d7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..ed6dcc6
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2200 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_fdw_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+			XLogRecPtr			recptr;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+			XLogFlush(recptr);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)XLogRecGetData(record);
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 50c70b2..aacf744 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -59,6 +59,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1453,6 +1454,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 82f9a3c..1e420b1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -115,6 +116,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -188,6 +191,10 @@ typedef struct TransactionStateData
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;		/* Enter/ExitParallelMode counter */
 	struct TransactionStateData *parent;		/* back link to parent */
+	int			num_foreign_servers;	/* number of foreign servers participating in the transaction,
+										   Only valid for top level transaction */
+	int			can_prepare;			/* can all the foreign server involved in
+										   this transaction participate in 2PC */
 } TransactionStateData;
 
 typedef TransactionStateData *TransactionState;
@@ -1919,6 +1926,9 @@ StartTransaction(void)
 	AtStart_Cache();
 	AfterTriggerBeginXact();
 
+	/* Foreign transaction stuff */
+	s->num_foreign_servers = 0;
+
 	/*
 	 * done with start processing, set current transaction state to "in
 	 * progress"
@@ -1979,6 +1989,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2137,6 +2150,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2158,6 +2172,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2224,6 +2240,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2429,6 +2448,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2610,9 +2631,12 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4296,6 +4320,32 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to forget */
+	if (max_fdw_xacts == 0)
+		return;
+
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f23e108..166b997 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -23,6 +23,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5075,6 +5076,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -6142,6 +6144,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6835,7 +6840,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7460,6 +7468,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7646,6 +7655,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8953,6 +8965,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9390,7 +9407,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9411,6 +9429,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9426,6 +9445,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9614,6 +9634,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9656,6 +9677,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9806,6 +9828,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 38be9cf..2a20a10 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index d5d40e6..2981925 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1080,6 +1081,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1375,6 +1390,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..0122d63 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -436,6 +436,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -697,6 +700,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -994,6 +1000,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index c95ca5b..57cba91 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepLauncherLock				44
 LogicalRepWorkerLock				45
+FDWXactLock					46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0249721..cc01bef 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2059,6 +2060,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 661b0fa..da979c5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 540427a..a4fabfe 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -205,6 +205,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 20077a6..3370966 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -301,5 +301,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 96b7097..0dcdf2f 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -586,6 +586,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -802,6 +803,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..c211f93 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,9 +8,11 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..d326ac1
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index b892aea..93edbb5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..ddb6b5f 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
@@ -356,6 +359,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 578bff5..71526ce 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 23731e9..3920cce 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 41c12af..d7dc950 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5271,6 +5271,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..565aa1b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -219,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5f38fa6..e5f9d73 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -256,11 +256,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 5bdca82..750054c 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -120,4 +120,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c661f1d..b0d27ec 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index d4d00d9..a1086d4 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

001_pgfdw_support_atomic_commit_v8.patchapplication/octet-stream; name=001_pgfdw_support_atomic_commit_v8.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..cc2b2c6 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -235,11 +262,14 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 			msglen = strlen(connmessage);
 			if (msglen > 0 && connmessage[msglen - 1] == '\n')
 				connmessage[msglen - 1] = '\0';
-			ereport(ERROR,
-			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-				errmsg("could not connect to server \"%s\"",
-					   server->servername),
-				errdetail_internal("%s", connmessage)));
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"", server->servername),
+						 errdetail_internal("%s", connmessage)));
 		}
 
 		/*
@@ -370,15 +400,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -586,158 +623,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -836,3 +999,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..8c52a11 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7053,3 +7081,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..9f203ad 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1319,7 +1327,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1696,7 +1704,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2291,7 +2299,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2553,7 +2561,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3490,7 +3498,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3580,7 +3588,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3803,7 +3811,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..ff57e98 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..d52e0a9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1683,3 +1713,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index b31f373..21abe78 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>

002_pg_fdw_resolver_contrib_v8.patchapplication/octet-stream; name=002_pg_fdw_resolver_contrib_v8.patchDownload

diff --git a/contrib/pg_fdw_xact_resolver/Makefile b/contrib/pg_fdw_xact_resolver/Makefile
new file mode 100644
index 0000000..f8924f0
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/Makefile
@@ -0,0 +1,15 @@
+# contrib/pg_fdw_xact_resolver/Makefile
+
+MODULES = pg_fdw_xact_resolver
+PGFILEDESC = "pg_fdw_xact_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_fdw_xact_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
new file mode 100644
index 0000000..c57de0a
--- /dev/null
+++ b/contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
@@ -0,0 +1,453 @@
+/* -------------------------------------------------------------------------
+ *
+ * pg_fdw_xact_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/pg_fdw_xact_resolver/pg_fdw_xact_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("pg_fdw_xact_resolver.naptime",
+							"Time to sleep between pg_fdw_xact_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	*dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("pg_fdw_xact_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		int naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid dbid;
+
+			/* Get the database list if empty*/
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+														   fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List *dblist = NIL;
+	ListCell *cell;
+	ListCell *next;
+	ListCell *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple tup;
+	Relation rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry
+	 * from the list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid dbid = lfirst_oid(cell);
+		bool exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 03e5889..bfeacdc 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -127,6 +127,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &passwordcheck;
  &pgbuffercache;
  &pgcrypto;
+ &pg-fdw-xact-resolver
  &pgfreespacemap;
  &pgprewarm;
  &pgrowlocks;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index e7aa92f..ed9c513 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -133,6 +133,7 @@
 <!ENTITY passwordcheck   SYSTEM "passwordcheck.sgml">
 <!ENTITY pgbuffercache   SYSTEM "pgbuffercache.sgml">
 <!ENTITY pgcrypto        SYSTEM "pgcrypto.sgml">
+<!ENTITY pg-fdw-xact-resolver SYSTEM "pg-fdw-xact-resolver.sgml">
 <!ENTITY pgfreespacemap  SYSTEM "pgfreespacemap.sgml">
 <!ENTITY pgprewarm       SYSTEM "pgprewarm.sgml">
 <!ENTITY pgrowlocks      SYSTEM "pgrowlocks.sgml">
diff --git a/doc/src/sgml/pg-fdw-xact-resolver.sgml b/doc/src/sgml/pg-fdw-xact-resolver.sgml
new file mode 100644
index 0000000..b47073c
--- /dev/null
+++ b/doc/src/sgml/pg-fdw-xact-resolver.sgml
@@ -0,0 +1,60 @@
+<!-- doc/src/sgml/pg-fdw-xact-resolver.sgml -->
+
+<sect1 id="pg-fdw-xact-resolver" xreflabel="pg_fdw_xact_resolver">
+ <title>pg_fdw_xact_resolver</title>
+
+ <indexterm zone="pg-fdw-xact-resolver">
+  <primary>pg_fdw_xact_resolver</primary>
+ </indexterm>
+
+ <para>
+  The <filename>pg_fdw_xact_resolver</> module launches foreign transaction
+  resolver process to resolve unresolved transactions prepared on foreign
+  servers.
+ </para>
+
+ <para>
+  The transaction involving multiple foreign servers uses two-phase-commit
+  protocol when transaction commits. Any crash or connection failure
+  after transaction prepared but before commit leaves the preapred transaction
+  in unresolved state. To resolve such a dandling transaction, we need to
+  call <function>pg_fdw_xact_resolve</function>.
+ </para>
+
+ <para>
+  The foreign transaction resolver process launches separate bacground
+  worker process to resolve the dangling forign transaction in each
+  database. The process simply connects to the database as needed and
+  callls <function>pg_fdw_xact_resolve</function>. The launcher process
+  launches at most one worker at a time.
+ </para>
+ 
+ <sect2>
+  <title>Configuration Parameters</title>
+
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>pg_fdw_xact_resovler.naptime</varname> (<type>integer</type>)
+    </term>
+
+    <listitem>
+     <para>
+      Specifies the minimum delay between foreign transaction resolver runs
+      on any given database. The dealy is measured in seconds, and the
+      default is one minute (1min). This parameter can only be set in
+      the <filename>postgresql.conf</filename> file of on the server
+      command line.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect2>
+
+ <sect2>
+  <title>Author</title>
+  <para>
+   Ahutosh Bapat <email>ashutosh.bapat@enterprisedb.com</email>, Masahiko Sawada
+   <email>sawada.mshk@gmail.com</email>
+  </para>
+ </sect2>

#122

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#121)

Re: Transactions involving multiple postgres foreign servers

On Wed, Feb 15, 2017 at 3:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Feb 6, 2017 at 10:48 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Feb 1, 2017 at 8:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 30, 2017 at 2:30 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

"txn" can be used for abbreviation of "Transaction", so for example
pg_fdw_txn_resolver?
I'm also fine to change the module and function name.

If we're judging the relative clarity of various ways of abbreviating
the word "transaction", "txn" surely beats "x".

To repeat my usual refrain, is there any merit to abbreviating at all?
Could we call it, say, "fdw_transaction_resolver" or
"fdw_transaction_manager"?

Almost modules in contrib are name with "pg_" prefix but I prefer
"fdw_transcation_resolver" if we don't need "pg_" prefix.

Since previous patches conflict to current HEAD, attached latest
version patches.
Please review them.

I've created a wiki page[1]2PC on FDW <https://wiki.postgresql.org/wiki/2PC_on_FDW> describing about the design and
functionality of this feature. Also it has some examples of use case,
so this page would be helpful for even testing. Please refer it if
you're interested in testing this feature.

[1]: 2PC on FDW <https://wiki.postgresql.org/wiki/2PC_on_FDW>
<https://wiki.postgresql.org/wiki/2PC_on_FDW>

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#123

vinayak

Pokale_Vinayak_q3@lab.ntt.co.jp

almost 9 years ago

In reply to: Masahiko Sawada (#122)

Re: Transactions involving multiple postgres foreign servers

On 2017/02/28 16:54, Masahiko Sawada wrote:

I've created a wiki page[1] describing about the design and
functionality of this feature. Also it has some examples of use case,
so this page would be helpful for even testing. Please refer it if
you're interested in testing this feature.

[1] 2PC on FDW
<https://wiki.postgresql.org/wiki/2PC_on_FDW>

Thank you for creating the wiki page.

In the "src/test/regress/pg_regress.c" file
-                * xacts.  (Note: to reduce the probability of 
unexpected shmmax
-                * failures, don't set max_prepared_transactions any 
higher than
-                * actually needed by the prepared_xacts regression test.)
+                * xacts. We also set *max_fdw_transctions* to enable 
testing of atomic
+                * foreign transactions. (Note: to reduce the 
probability of unexpected
+                * shmmax failures, don't set max_prepared_transactions or
+                * max_prepared_foreign_transactions any higher than 
actually needed by the
+                * corresponding regression tests.).

I think we are not setting the "*max_fdw_transctions" *anywhere.
Is this correct?

In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used
two times.
+ #include "access/fdw_xact.h"
I think we need to remove one line.

Regards,
Vinayak Pokale

#124

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: vinayak (#123)

Re: Transactions involving multiple postgres foreign servers

On Thu, Mar 2, 2017 at 11:56 AM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2017/02/28 16:54, Masahiko Sawada wrote:

I've created a wiki page[1] describing about the design and
functionality of this feature. Also it has some examples of use case,
so this page would be helpful for even testing. Please refer it if
you're interested in testing this feature.

[1] 2PC on FDW
<https://wiki.postgresql.org/wiki/2PC_on_FDW>

Thank you for creating the wiki page.

Thank you for looking at this patch.

In the "src/test/regress/pg_regress.c" file
-                * xacts.  (Note: to reduce the probability of unexpected
shmmax
-                * failures, don't set max_prepared_transactions any higher
than
-                * actually needed by the prepared_xacts regression test.)
+                * xacts. We also set max_fdw_transctions to enable testing
of atomic
+                * foreign transactions. (Note: to reduce the probability of
unexpected
+                * shmmax failures, don't set max_prepared_transactions or
+                * max_prepared_foreign_transactions any higher than
actually needed by the
+                * corresponding regression tests.).

I think we are not setting the "max_fdw_transctions" anywhere.
Is this correct?

This comment is out of date. Will fix.

In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two
times.
+ #include "access/fdw_xact.h"
I think we need to remove one line.

Not necessary. Will get rid of it.

Since these are not feature bugs I will incorporate these when making
update version patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#125

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#124)

5 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Fri, Mar 3, 2017 at 1:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 2, 2017 at 11:56 AM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2017/02/28 16:54, Masahiko Sawada wrote:

I've created a wiki page[1] describing about the design and
functionality of this feature. Also it has some examples of use case,
so this page would be helpful for even testing. Please refer it if
you're interested in testing this feature.

[1] 2PC on FDW
<https://wiki.postgresql.org/wiki/2PC_on_FDW>

Thank you for creating the wiki page.

Thank you for looking at this patch.
In the "src/test/regress/pg_regress.c" file
-                * xacts.  (Note: to reduce the probability of unexpected
shmmax
-                * failures, don't set max_prepared_transactions any higher
than
-                * actually needed by the prepared_xacts regression test.)
+                * xacts. We also set max_fdw_transctions to enable testing
of atomic
+                * foreign transactions. (Note: to reduce the probability of
unexpected
+                * shmmax failures, don't set max_prepared_transactions or
+                * max_prepared_foreign_transactions any higher than
actually needed by the
+                * corresponding regression tests.).
I think we are not setting the "max_fdw_transctions" anywhere.
Is this correct?
This comment is out of date. Will fix.

In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two
times.
+ #include "access/fdw_xact.h"
I think we need to remove one line.

Not necessary. Will get rid of it.

Since these are not feature bugs I will incorporate these when making
update version patches.

Attached updated set of patches.
The differences from previous patch are,
* Fixed a few bugs.
* Separated previous 000 patch into two patches.
* Changed name pg_fdw_xact_resovler contrib module to
fdw_transaction_resolver.
* Incorporated review comments got from Vinayak

Please review these patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_register_local_write_v9.patchapplication/octet-stream; name=000_register_local_write_v9.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 82f9a3c..0f057e4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -115,6 +115,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -2158,6 +2160,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2429,6 +2433,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2613,6 +2619,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4296,6 +4304,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..0122d63 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -436,6 +436,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -697,6 +700,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -994,6 +1000,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..586f340 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -356,6 +356,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);

001_support_fdw_xact_v9.patchapplication/octet-stream; name=001_support_fdw_xact_v9.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index cd82c04..d0ca05d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1431,6 +1431,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index dbeaab5..639e38b 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1714,5 +1714,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension fdw_transaction_resolver
+    and including $libdir/fdw_transaction_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..7cc491e
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..46307d7 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_fdw_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..ed6dcc6
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2200 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_fdw_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_fdw_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_fdw_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_fdw_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_fdw_xacts));
+		for (cnt = 0; cnt < max_fdw_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_fdw_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_fdw_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+			XLogRecPtr			recptr;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+			XLogFlush(recptr);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_fdw_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)XLogRecGetData(record);
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0a8edb9..aa4c17d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -58,6 +58,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0f057e4..438c1f0 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1981,6 +1982,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2139,6 +2143,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2228,6 +2233,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2616,6 +2624,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4309,6 +4318,10 @@ AbortOutOfAnyTransaction(void)
 void
 RegisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
 	XactWriteLocalNode = true;
 }
 
@@ -4318,6 +4331,10 @@ RegisterTransactionLocalNode(void)
 void
 UnregisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_fdw_xacts == 0)
+		return;
+
 	XactWriteLocalNode = false;
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8973583..bb8779c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5055,6 +5056,7 @@ BootStrapXLOG(void)
 	ControlFile->wal_log_hints = wal_log_hints;
 	ControlFile->track_commit_timestamp = track_commit_timestamp;
 	ControlFile->data_checksum_version = bootstrap_data_checksum_version;
+	ControlFile->max_fdw_xacts = max_fdw_xacts;
 
 	/* some additional ControlFile fields are set in WriteControlFile() */
 
@@ -6122,6 +6124,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_fdw_xacts,
+									 ControlFile->max_fdw_xacts);
 	}
 }
 
@@ -6815,7 +6820,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7440,6 +7448,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7626,6 +7635,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8933,6 +8945,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9370,7 +9387,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_fdw_xacts != ControlFile->max_fdw_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9391,6 +9409,7 @@ XLogReportParameters(void)
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
 			xlrec.track_commit_timestamp = track_commit_timestamp;
+			xlrec.max_fdw_xacts = max_fdw_xacts;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
@@ -9406,6 +9425,7 @@ XLogReportParameters(void)
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
 		ControlFile->track_commit_timestamp = track_commit_timestamp;
+		ControlFile->max_fdw_xacts = max_fdw_xacts;
 		UpdateControlFile();
 	}
 }
@@ -9594,6 +9614,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9636,6 +9657,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9786,6 +9808,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
+		ControlFile->max_fdw_xacts = xlrec.max_fdw_xacts;
 
 		/*
 		 * Update minRecoveryPoint to ensure that if recovery is aborted, we
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index ba980de..c2f4b46 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index d5d40e6..2981925 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1080,6 +1081,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1375,6 +1390,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index c95ca5b..57cba91 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepLauncherLock				44
 LogicalRepWorkerLock				45
+FDWXactLock					46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0707f66..eb858af 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2055,6 +2056,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_fdw_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 157d775..8770529 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 1ed0d20..3624044 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -204,6 +204,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index f47171d..b7f7dd3 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -302,5 +302,7 @@ main(int argc, char *argv[])
 		   (ControlFile->float8ByVal ? _("by value") : _("by reference")));
 	printf(_("Data page checksum version:           %u\n"),
 		   ControlFile->data_checksum_version);
+	printf(_("Current max_fdw_xacts setting:   %d\n"),
+		   ControlFile->max_fdw_xacts);
 	return 0;
 }
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 27bd9b0..94aa7a8 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -585,6 +585,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -797,6 +798,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_fdw_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..41eed51 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,6 +8,7 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..d326ac1
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_fdw_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index b892aea..93edbb5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 586f340..ddb6b5f 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 578bff5..71526ce 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e4194b9..984827b 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -180,6 +180,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_fdw_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ec4aedb..7c4c8a9 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5275,6 +5275,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..fdb7b19 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -220,6 +239,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5f38fa6..e5f9d73 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -256,11 +256,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1435a7b..843c629 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -121,4 +121,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index c661f1d..b0d27ec 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index d4d00d9..a1086d4 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2273,7 +2275,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

002_pgfdw_support_atomic_commit_v9.patchapplication/octet-stream; name=002_pgfdw_support_atomic_commit_v9.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..14ab99e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -226,11 +253,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
 			ereport(ERROR,
 			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 				errmsg("could not connect to server \"%s\"",
 					   server->servername),
 				errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -360,15 +401,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -576,158 +624,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -826,3 +1000,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..8c52a11 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7053,3 +7081,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..9f203ad 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "commands/defrem.h"
@@ -465,6 +467,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1319,7 +1327,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1696,7 +1704,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2291,7 +2299,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2553,7 +2561,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3490,7 +3498,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3580,7 +3588,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3803,7 +3811,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..ff57e98 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..d52e0a9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1683,3 +1713,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index b31f373..21abe78 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
index 7cc491e..5c35bd1 100644
--- a/src/backend/access/rmgrdesc/fdw_xactdesc.c
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -5,7 +5,7 @@
  *
  * This module describes the WAL records for foreign transaction manager.
  *
- * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
  *
  * src/backend/access/transam/fdw_xactdesc.c

003_fdw_transaction_resolver_contrib_v9.patchapplication/octet-stream; name=003_fdw_transaction_resolver_contrib_v9.patchDownload

diff --git a/contrib/fdw_transaction_resovler/Makefile b/contrib/fdw_transaction_resovler/Makefile
new file mode 100644
index 0000000..0d2e0e9
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/Makefile
@@ -0,0 +1,15 @@
+# contrib/fdw_transaction_resolver/Makefile
+
+MODULES = fdw_transaction_resolver
+PGFILEDESC = "fdw_transaction_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/fdw_transaction_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/fdw_transaction_resovler/TAGS b/contrib/fdw_transaction_resovler/TAGS
new file mode 120000
index 0000000..cf64c85
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/TAGS
@@ -0,0 +1 @@
+/home/masahiko/pgsql/source/postgresql/TAGS
\ No newline at end of file
diff --git a/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
new file mode 100644
index 0000000..a6eb2c3
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
@@ -0,0 +1,453 @@
+/* -------------------------------------------------------------------------
+ *
+ * fdw_transaction_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/fdw_transaction_resolver/fdw_transaction_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("fdw_transaction_resolver.naptime",
+							"Time to sleep between fdw_transaction_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	*dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("fdw_transaction_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		int naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid dbid;
+
+			/* Get the database list if empty*/
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+														   fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List *dblist = NIL;
+	ListCell *cell;
+	ListCell *next;
+	ListCell *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple tup;
+	Relation rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry
+	 * from the list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid dbid = lfirst_oid(cell);
+		bool exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index 03e5889..2539cb9 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -115,6 +115,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
+ $fdw-transaction-resolver;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index e7aa92f..5d8200f 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -120,6 +120,7 @@
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">
 <!ENTITY dummy-seclabel  SYSTEM "dummy-seclabel.sgml">
 <!ENTITY earthdistance   SYSTEM "earthdistance.sgml">
+<!ENTITY fdw-transaction-resolver SYSTEM "fdw-transaction-resolver.sgml">
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">

004_regression_test_for_fdw_xact_v9.patchapplication/octet-stream; name=004_regression_test_for_fdw_xact_v9.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/009_fdw_xact.pl b/src/test/recovery/t/009_fdw_xact.pl
new file mode 100644
index 0000000..79711bc
--- /dev/null
+++ b/src/test/recovery/t/009_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("maseter");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index a1086d4..49ff566 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2256,9 +2256,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts. We also set max_fdw_transctions to enable testing of atomic
-		 * foreign transactions. (Note: to reduce the probability of unexpected
-		 * shmmax failures, don't set max_prepared_transactions or
+		 * xacts. We also set max_prepared_foreign_transctions to enable testing
+		 * of atomic foreign transactions. (Note: to reduce the probability of
+		 * unexpected shmmax failures, don't set max_prepared_transactions or
 		 * max_prepared_foreign_transactions any higher than actually needed by the
 		 * corresponding regression tests.).
 		 */

#126

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#125)

5 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Tue, Mar 7, 2017 at 5:04 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Mar 3, 2017 at 1:24 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:
On Thu, Mar 2, 2017 at 11:56 AM, vinayak
<Pokale_Vinayak_q3@lab.ntt.co.jp> wrote:

On 2017/02/28 16:54, Masahiko Sawada wrote:

I've created a wiki page[1] describing about the design and
functionality of this feature. Also it has some examples of use case,
so this page would be helpful for even testing. Please refer it if
you're interested in testing this feature.

[1] 2PC on FDW
<https://wiki.postgresql.org/wiki/2PC_on_FDW>

Thank you for creating the wiki page.

Thank you for looking at this patch.
In the "src/test/regress/pg_regress.c" file
-                * xacts.  (Note: to reduce the probability of unexpected
shmmax
-                * failures, don't set max_prepared_transactions any higher
than
-                * actually needed by the prepared_xacts regression test.)
+                * xacts. We also set max_fdw_transctions to enable testing
of atomic
+                * foreign transactions. (Note: to reduce the probability of
unexpected
+                * shmmax failures, don't set max_prepared_transactions or
+                * max_prepared_foreign_transactions any higher than
actually needed by the
+                * corresponding regression tests.).
I think we are not setting the "max_fdw_transctions" anywhere.
Is this correct?
This comment is out of date. Will fix.

In the "src/bin/pg_waldump/rmgrdesc.c" file following header file used two
times.
+ #include "access/fdw_xact.h"
I think we need to remove one line.

Not necessary. Will get rid of it.

Since these are not feature bugs I will incorporate these when making
update version patches.
Attached updated set of patches.
The differences from previous patch are,
* Fixed a few bugs.
* Separated previous 000 patch into two patches.
* Changed name pg_fdw_xact_resovler contrib module to
fdw_transaction_resolver.
* Incorporated review comments got from Vinayak

Please review these patches.

Since previous v9 patches conflict with current HEAD, I attached latest patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_register_local_write_v10.patchapplication/octet-stream; name=000_register_local_write_v10.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 82f9a3c..0f057e4 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -115,6 +115,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -2158,6 +2160,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2429,6 +2433,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2613,6 +2619,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4296,6 +4304,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e1589..0122d63 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -436,6 +436,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -697,6 +700,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -994,6 +1000,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..586f340 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -356,6 +356,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);

001_support_fdw_xact_v10.patchapplication/octet-stream; name=001_support_fdw_xact_v10.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 69844e5..e2064ba 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1432,6 +1432,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index dbeaab5..639e38b 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1714,5 +1714,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension fdw_transaction_resolver
+    and including $libdir/fdw_transaction_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..7cc491e
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,63 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+		/* TODO: we also need to assess whether we want to add this information */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+							fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec	*fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch(info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..ff3064e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..48f076a
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2200 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData	*FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid							serverid;
+	Oid							userid;
+	Oid							umid;
+	char						*servername;
+	FDWXact						fdw_xact;	/* foreign prepared transaction entry
+											   in case prepared */
+	bool						two_phase_commit;	/* Should use two phase commit
+													 * protocol while committing
+													 * transaction on this
+													 * server, whenever
+													 * necessary.
+													 */
+	GetPrepareId_function		get_prepare_id;
+	EndForeignTransaction_function	end_foreign_xact;
+	PrepareForeignTransaction_function	prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function	resolve_prepared_foreign_xact;
+} FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	*MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool	TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection	*fdw_conn;
+	ListCell		*lcell;
+	ForeignServer	*foreign_server;
+	ForeignDataWrapper	*fdw;
+	UserMapping		*user_mapping;
+	FdwRoutine		*fdw_routine;
+	MemoryContext	old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+					(errmsg("attempt to start transction again on server %u user %u",
+							serverid, userid)));
+	}
+
+	/* This list and its contents needs to be saved in the transaction context memory */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+			elog(ERROR, "no function to end a foreign transaction provided for FDW %s",
+					fdw->fdwname);
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			elog(ERROR, "no prepared transaction identifier providing function for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			elog(ERROR, "no function provided for preparing foreign transaction for FDW %s",
+					fdw->fdwname);
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			elog(ERROR, "no function provided for resolving prepared foreign transaction for FDW %s",
+					fdw->fdwname);
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN	256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,	/* foreign prepared transaction is to be committed */
+	FDW_XACT_ABORTING_PREPARED,	/* foreign prepared transaction is to be aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								   It doesn't appear in the in-memory entry. */
+} FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact			fx_next;	/* Next free FDWXact entry */
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;	/* XID of local transaction */
+	Oid				serverid;	/* foreign server where transaction takes place */
+	Oid				userid;		/* user who initiated the foreign transaction */
+	Oid				umid;		/* user mapping id for connection key */
+	FDWXactStatus	fdw_xact_status;	/* The state of the foreign transaction.
+										   This doubles as the action to be
+										   taken on this entry.*/
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior
+	 * to commit.
+	 */
+	XLogRecPtr		fdw_xact_start_lsn;   /* XLOG offset of inserting this entry start */
+	XLogRecPtr		fdw_xact_end_lsn;   /* XLOG offset of inserting this entry end*/
+
+	bool			fdw_xact_valid;		/* Has the entry been complete and written to file? */
+	BackendId		locking_backend;	/* Backend working on this entry */
+	bool            ondisk;             /* TRUE if prepare state file is on disk */
+	int				fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char			fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction identifier */
+} FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			num_fdw_xacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];	/* Variable length array */
+} FDWXactGlobalData;
+
+/*
+ * During replay and replication KnownFDWXactList holds info about active foreign server
+ * transactions that weren't moved to files yet. We will need that info by the end of
+ * recovery (including promote) to restore memory state of that transactions.
+ *
+ * Naive approach here is to move each PREPARE record to disk, fsync it and don't have
+ * that list at all, but that provokes a lot of unnecessary fsyncs on small files
+ * causing replica to be slower than master.
+ *
+ * Replay of twophase records happens by the following rules:
+ *		* On PREPARE redo KnownFDWXactAdd() is called to add that transaction to
+ *		  KnownFDWXactList and no more actions are taken.
+ *		* On checkpoint redo we iterate through KnownFDWXactList and move all prepare
+ *		  records that behind redo_horizon to files and deleting them from list.
+ *		* On COMMIT/ABORT we delete file or entry in KnownFDWXactList.
+ *		* At the end of recovery we move all known foreign server transactions to disk
+ *		  to allow RecoverPreparedTransactions/StandbyRecoverPreparedTransactions
+ *		  do their work.
+ */
+typedef struct KnownFDWXact
+{
+	TransactionId	local_xid;
+	Oid				serverid;
+	Oid				userid;
+	XLogRecPtr		fdw_xact_start_lsn;
+	XLogRecPtr		fdw_xact_end_lsn;
+	dlist_node		list_node;
+} KnownFDWXact;
+
+static dlist_head KnownFDWXactList = DLIST_STATIC_INIT(KnownFDWXactList);
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+							ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id,
+							   FDWXactStatus fdw_xact_status);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int GetFDWXactList(FDWXact *fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+											Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+								void  *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+						List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int	max_prepared_foreign_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData	*FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	*MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact	fdw_xacts;
+		int		cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->num_fdw_xacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell	*cur;
+	ListCell	*prev;
+	ListCell	*next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not execute
+	 * two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+								fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to use
+	 * two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server remaining
+		 * even if this server can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell	*lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection	*fdw_conn = (FDWConnection *)lfirst(lcell);
+		char			*fdw_xact_id;
+		int				fdw_xact_id_len;
+		FDWXact			fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+												 fdw_conn->userid,
+												 &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to prepare
+		 * it on the foreign server. Registration persists this information to
+		 * the disk and logs (that way relaying it on standby). Thus in case we
+		 * loose connectivity to the foreign server or crash ourselves, we will
+		 * remember that we have prepared transaction on the foreign server and
+		 * try to resolve it when connectivity is restored or after crash
+		 * recovery.
+		 *
+		 * If we crash after persisting the information but before preparing the
+		 * transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long as
+		 * the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before persisting
+		 * the information to the disk and crash in-between these two steps, we
+		 * will forget that we prepared the transaction on the foreign server
+		 * and will not be able to resolve it after the crash. Hence persist
+		 * first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction (say,
+		 * because of a signal). During abort processing, it will send an ABORT
+		 * message to the foreign server. If the foreign server has not prepared
+		 * the transaction, the message will succeed. If the foreign server has
+		 * prepared transaction, it will throw an error, which we will ignore and the
+		 * prepared foreign transaction will be resolved by the foreign transaction
+		 * resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction. Delete the
+			 * entry from foreign transaction table. Raise an error, so that the
+			 * local server knows that one of the foreign server has failed to
+			 * prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then we
+			 * raise actual error here. But instead, we should pull the error
+			 * text from FDW and add it here in the message or as a context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell
+			 * pointer, but that is fine since we will not use that pointer
+			 * because the subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid,	int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact				fdw_xact;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	int					data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id,
+							   FDW_XACT_PREPARING);
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+					fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *)fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->fdw_xact_valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. The inserted entry
+ * is returned locked.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id, FDWXactStatus fdw_xact_status)
+{
+	FDWXact			fdw_xact;
+	int				cnt;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+				fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = NULL;
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+						xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions (currently %d).",
+						 max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->num_fdw_xacts < max_prepared_foreign_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = MyBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_status = fdw_xact_status;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			FdwRemoveXlogRec	fdw_remove_xlog;
+			XLogRecPtr			recptr;
+
+			/* Fill up the log record before releasing the entry */
+			fdw_remove_xlog.serverid = fdw_xact->serverid;
+			fdw_remove_xlog.dbid = fdw_xact->dboid;
+			fdw_remove_xlog.xid = fdw_xact->local_xid;
+			fdw_remove_xlog.userid = fdw_xact->userid;
+
+			/* Remove the entry from active array */
+			FDWXactGlobal->num_fdw_xacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->num_fdw_xacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			START_CRIT_SECTION();
+
+			/*
+			 * Log that we are removing the foreign transaction entry and remove
+			 * the file from the disk as well.
+			 */
+			XLogBeginInsert();
+			XLogRegisterData((char *)&fdw_remove_xlog, sizeof(fdw_remove_xlog));
+			recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+			XLogFlush(recptr);
+
+			END_CRIT_SECTION();
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_remove_xlog.xid, fdw_remove_xlog.serverid,
+								  fdw_remove_xlog.userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse the
+	 * order and process exits in-between those two, we will be left an entry
+	 * locked by this backend, which gets unlocked only at the server restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact	fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell	*lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact	fdw_xact = fdw_conn->fdw_xact;
+			fdw_xact->fdw_xact_status = (is_commit ?
+											FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED);
+			/* Try aborting or committing the transaction on the foreign server */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server, unlock
+				 * it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be executed
+			 * we have tried to commit the transactions during pre-commit phase.
+			 * Any remaining transactions need to be aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+								is_commit ? "commit" : "abort",
+								fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the entries,
+	 * and may not be able to unlock them if aborted in-between. In any case,
+	 * there is no reason for a foreign transaction entry to be locked after the
+	 * transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared should
+	 * be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	*entries_to_resolve;
+
+	FDWXactStatus	status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+											FDW_XACT_ABORTING_PREPARED;
+	/* Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to
+	 * check whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_resolve);
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->fdw_xact_status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+								get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer		*foreign_server;
+	ForeignDataWrapper	*fdw;
+	FdwRoutine			*fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool	resolved;
+	bool	is_commit;
+
+	Assert(fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+			fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED) ?
+							true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid	  | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid	  | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid	  | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid	  | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid	  | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid	  | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid	  | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid	  | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool	entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list. If
+				 * the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+					fdw_xact->locking_backend = MyBackendId;
+
+					/* The list and its members may be required at the end of the transaction */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char    *rec = XLogRecGetData(record);
+	uint8   info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		KnownFDWXactAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec        *fdw_remove_xlog = (FdwRemoveXlogRec *)rec;
+		KnownFDWXactRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						   fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int cnt;
+	int serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->num_fdw_xacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked invalid,
+	 * because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->num_fdw_xacts; cnt++)
+	{
+		FDWXact	fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if (fdw_xact->fdw_xact_valid &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char *buf;
+			int len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord			*record;
+	XLogReaderState		*xlogreader;
+	char				*errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read foreign transaction state from xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not recreate foreign transaction state file \"%s\": %m",
+						path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact	fdw_xacts;
+	int		num_xacts;
+	int		cur_xact;
+} WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact *fdw_xacts)
+{
+	int	num_xacts;
+	int	cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->num_fdw_xacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->num_fdw_xacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext	oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus	*status;
+	char			*xact_status;
+	List			*entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send
+		 * out as a result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->num_fdw_xacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact	fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+						fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->fdw_xact_status == FDW_XACT_COMMITTING_PREPARED ||
+					fdw_xact->fdw_xact_status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign transaction
+				 * nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the backends. So
+				 * probably, it crashed before actual abort or commit. So assume it
+				 * to be aborted.
+				 */
+				fdw_xact->fdw_xact_status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the foreign
+				 * transaction. This can happen when the foreign transaction is
+				 * prepared as part of a local prepared transaction. Just
+				 * continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not successful,
+			 * unlock the entry so that someone else can pick it up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].fdw_xact_status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->fdw_xact_valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->fdw_xact_status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId	xid;
+	Oid				dbid;
+	Oid				serverid;
+	Oid				userid;
+	List			*entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+									DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+									PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact	fdw_xact = linitial(entries_to_remove);
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char				path[MAXPGPATH];
+	int					fd;
+	FDWXactOnDiskData	*fdw_xact_file_data;
+	struct stat			stat;
+	uint32				crc_offset;
+	pg_crc32c			calc_crc;
+	pg_crc32c			file_crc;
+	char				*buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+			(errcode_for_file_access(),
+			 errmsg("could not open FDW transaction state file \"%s\": %m",
+					path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not stat FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("Too large FDW transaction state file \"%s\": %m",
+							path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *)buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not read FDW transaction state file \"%s\": %m",
+							path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+			fdw_xact_file_data->userid != userid ||
+			fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+				  (errmsg("removing corrupt foreign transaction state file \"%s\"",
+							  path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId	nextXid = ShmemVariableCache->nextXid;
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	/*
+	 * Move foreign transactions from kownFDWXactList to files, if any.
+	 * It is possible to skip that step and teach subsequent code about
+	 * KnownFDWXactList, but whole PreScan() happens once during end of
+	 * recovery or promote, so probably it isn't worth complications.
+	 */
+	KnownFDWXactRecreateFiles(InvalidXLogRecPtr);
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding
+			 * to an XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+					  (errmsg("removing future foreign prepared transaction file \"%s\"",
+							  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+/*
+ * RecoverFDWXactFromFiles
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXactFromFiles(void)
+{
+	DIR				*cldir;
+	struct dirent	*clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+			strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid					serverid;
+			Oid					userid;
+			TransactionId		local_xid;
+			FDWXactOnDiskData	*fdw_xact_file_data;
+			FDWXact				fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+					&userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+						(errmsg("Removing corrupt foreign transaction file \"%s\"",
+								 clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+								local_xid, serverid, userid)));
+
+			/*
+			 * Add this entry into the table of foreign transactions. The status
+			 * of the transaction is set as preparing, since we do not know the
+			 * exact status right now. Resolver will set it later based on the
+			 * status of local transaction which prepared this foreign
+			 * transaction.
+			 */
+			fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+									   serverid, userid,
+									   fdw_xact_file_data->umid,
+									   fdw_xact_file_data->fdw_xact_id_len,
+									   fdw_xact_file_data->fdw_xact_id,
+									   FDW_XACT_PREPARING);
+
+			/* Add some valid LSNs */
+			fdw_xact->fdw_xact_start_lsn = 0;
+			fdw_xact->fdw_xact_end_lsn = 0;
+			/* Mark the entry as ready */
+			fdw_xact->fdw_xact_valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			/* Unlock the entry as we don't need it any further */
+			unlock_fdw_xact(fdw_xact);
+			pfree(fdw_xact_file_data);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+				   errmsg("could not remove foreign transaction state file \"%s\": %m",
+						  path)));
+}
+
+/*
+ * KnownFDWXactAdd
+ *
+ * Store correspondence of start/end lsn and xid in KnownFDWXactList.
+ * This is called during redo of prepare record to have list of prepared
+ * transactions on foreign server that aren't yet moved to 2PC files by the
+ * end of recovery.
+ */
+void
+KnownFDWXactAdd(XLogReaderState *record)
+{
+	KnownFDWXact *fdw_xact;
+	FDWXactOnDiskData *fdw_xact_data_file = (FDWXactOnDiskData *)XLogRecGetData(record);
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = (KnownFDWXact *) palloc(sizeof(KnownFDWXact));
+	fdw_xact->local_xid = fdw_xact_data_file->local_xid;
+	fdw_xact->serverid = fdw_xact_data_file->serverid;;
+	fdw_xact->userid = fdw_xact_data_file->userid;;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+
+	dlist_push_tail(&KnownFDWXactList, &fdw_xact->list_node);
+}
+
+/*
+ * KnownFDWXactRemove
+ *
+ * Forgot about foreign transaction. Called during commit/abort redo.
+ */
+void
+KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	dlist_mutable_iter miter;
+
+	Assert(RecoveryInProgress());
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact *fdw_xact = dlist_container(KnownFDWXact, list_node,
+												 miter.cur);
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			dlist_delete(miter.cur);
+			/*
+			 * SInce we found entry in KnownFDWXactList we know that file
+			 * isn't on disk yet and we can end up here.
+			 */
+			return;
+		}
+	}
+
+	/*
+	 * Here we know that file should be removed from disk. But aborting
+	 * recovery because of absence of unnecessary file doesn't seems to
+	 * be a good idea, so call remove with giveWarning = false.
+	 */
+	RemoveFDWXactFile(xid, serverid, userid, false);
+}
+
+/*
+ * KnownFDWXactRecreateFiles
+ *
+ * Moves foreign server transaction records from WAL to files. Called during
+ * checkpoint replay or PrescanPreparedTransactions.
+ *
+ * redo_horizon = InvalidXLogRecPtr indicates that all transactions from
+ *		KnownFDWXactList should be moved to disk.
+ */
+void
+KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon)
+{
+	dlist_mutable_iter miter;
+	int			serialized_fdw_xacts = 0;
+
+	Assert(RecoveryInProgress());
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	dlist_foreach_modify(miter, &KnownFDWXactList)
+	{
+		KnownFDWXact   *fdw_xact = dlist_container(KnownFDWXact,
+														list_node, miter.cur);
+
+		if (fdw_xact->fdw_xact_end_lsn <= redo_horizon || redo_horizon == InvalidXLogRecPtr)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			pfree(buf);
+			dlist_delete(miter.cur);
+			serialized_fdw_xacts++;
+		}
+	}
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+				(errmsg_plural("%u foreign transaction state file was written "
+							   "for long-running prepared transactions",
+							   "%u foreign transaction state files were written "
+							   "for long-running prepared transactions",
+							   serialized_fdw_xacts,
+							   serialized_fdw_xacts)));
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 0a8edb9..aa4c17d 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -58,6 +58,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1452,6 +1453,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 0f057e4..d37f503 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1981,6 +1982,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2139,6 +2143,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2228,6 +2233,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2616,6 +2624,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4309,6 +4318,10 @@ AbortOutOfAnyTransaction(void)
 void
 RegisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = true;
 }
 
@@ -4318,6 +4331,10 @@ RegisterTransactionLocalNode(void)
 void
 UnregisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = false;
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 744360c..a630660 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5074,6 +5075,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6146,6 +6148,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 	}
 }
 
@@ -6839,7 +6844,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7464,6 +7472,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7650,6 +7659,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXactFromFiles();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8957,6 +8969,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9394,7 +9411,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9411,6 +9429,7 @@ XLogReportParameters(void)
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
+			xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
@@ -9426,6 +9445,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9618,6 +9638,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9660,6 +9681,7 @@ xlog_redo(XLogReaderState *record)
 					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
 							checkPoint.ThisTimeLineID, ThisTimeLineID)));
 
+		KnownFDWXactRecreateFiles(checkPoint.redo);
 		RecoveryRestartPoint(&checkPoint);
 	}
 	else if (info == XLOG_CHECKPOINT_ONLINE)
@@ -9807,6 +9829,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 0bce209..e1f2771 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index d5d40e6..2981925 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1080,6 +1081,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1375,6 +1390,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index cd8b08f..148d19d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
+FDWXactLock					45
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 811ea51..d1f089f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2065,6 +2066,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a02b154..27c5342 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index da113bd..2ed329c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -205,6 +205,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..f703e60 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -276,6 +276,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 27bd9b0..e64498f 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -585,6 +585,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -797,6 +798,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..41eed51 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,6 +8,7 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..b8a6da2
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid				dboid;		/* database oid where to find foreign server and
+								 * user mapping
+								 */
+	TransactionId	local_xid;
+	Oid				serverid;			/* foreign server where transaction takes place */
+	Oid				userid;				/* user who initiated the foreign transaction */
+	Oid				umid;
+	uint32			fdw_xact_id_len;	/* Length of the value stored in the next field */
+	/* This should always be the last member */
+	char			fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];	/* variable length array
+														 * to store foreign transaction
+														 * information.
+														 */
+} FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId	xid;
+	Oid				serverid;
+	Oid				userid;
+	Oid				dbid;
+} FdwRemoveXlogRec;
+
+extern int	max_prepared_foreign_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXactFromFiles(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+								Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void KnownFDWXactAdd(XLogReaderState *record);
+extern void KnownFDWXactRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index b892aea..93edbb5 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 586f340..ddb6b5f 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 578bff5..d192136 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..c57a66f 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index ec4aedb..7c4c8a9 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5275,6 +5275,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4109 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4110 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4111 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..fdb7b19 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -220,6 +239,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5f38fa6..e5f9d73 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -256,11 +256,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1435a7b..843c629 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -121,4 +121,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bd13ae6..697ff81 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index c393ae1..61ce842 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2258,9 +2258,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2275,7 +2277,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

002_pgfdw_support_atomic_commit_v10.patchapplication/octet-stream; name=002_pgfdw_support_atomic_commit_v10.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..14ab99e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -226,11 +253,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
 			ereport(ERROR,
 			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 				errmsg("could not connect to server \"%s\"",
 					   server->servername),
 				errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -360,15 +401,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -576,158 +624,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -826,3 +1000,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..8c52a11 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7053,3 +7081,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 990313a..20f8472 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -466,6 +468,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1320,7 +1328,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1697,7 +1705,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2292,7 +2300,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2554,7 +2562,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3491,7 +3499,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3581,7 +3589,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3804,7 +3812,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..ff57e98 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -102,7 +103,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -163,6 +165,14 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..d52e0a9 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1683,3 +1713,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 7a9b655..8f6ab2c 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>

003_fdw_transaction_resolver_contrib_v10.patchapplication/octet-stream; name=003_fdw_transaction_resolver_contrib_v10.patchDownload

diff --git a/contrib/fdw_transaction_resovler/Makefile b/contrib/fdw_transaction_resovler/Makefile
new file mode 100644
index 0000000..0d2e0e9
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/Makefile
@@ -0,0 +1,15 @@
+# contrib/fdw_transaction_resolver/Makefile
+
+MODULES = fdw_transaction_resolver
+PGFILEDESC = "fdw_transaction_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/fdw_transaction_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/fdw_transaction_resovler/TAGS b/contrib/fdw_transaction_resovler/TAGS
new file mode 120000
index 0000000..cf64c85
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/TAGS
@@ -0,0 +1 @@
+/home/masahiko/pgsql/source/postgresql/TAGS
\ No newline at end of file
diff --git a/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
new file mode 100644
index 0000000..a6eb2c3
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
@@ -0,0 +1,453 @@
+/* -------------------------------------------------------------------------
+ *
+ * fdw_transaction_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/fdw_transaction_resolver/fdw_transaction_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("fdw_transaction_resolver.naptime",
+							"Time to sleep between fdw_transaction_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag even
+	 * if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	*dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("fdw_transaction_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);		/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM);	/* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT);	/* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int		rc;
+		int naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will restart
+		 * them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid dbid;
+
+			/* Get the database list if empty*/
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+														   fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	*command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid		dbid = DatumGetObjectId(dbid_datum);
+	int		ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler only
+	 * for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function.
+	 * Note that each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time
+	 * for the statement we're about the run, and also the transaction
+	 * start time.  Also, each other query sent to SPI should probably be
+	 * preceded by SetCurrentStatementStartTimestamp(), so that statement
+	 * start time is always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager,
+	 * and the PushActiveSnapshot() call creates an "active" snapshot
+	 * which is necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible
+	 * through the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List *dblist = NIL;
+	ListCell *cell;
+	ListCell *next;
+	ListCell *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple tup;
+	Relation rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry
+	 * from the list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid dbid = lfirst_oid(cell);
+		bool exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index eaaa36c..63a33fd 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -116,6 +116,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
+ $fdw-transaction-resolver;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 6782f07..6d28cbd 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -121,6 +121,7 @@
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">
 <!ENTITY dummy-seclabel  SYSTEM "dummy-seclabel.sgml">
 <!ENTITY earthdistance   SYSTEM "earthdistance.sgml">
+<!ENTITY fdw-transaction-resolver SYSTEM "fdw-transaction-resolver.sgml">
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">

004_regression_test_for_fdw_xact_v10.patchapplication/octet-stream; name=004_regression_test_for_fdw_xact_v10.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/009_fdw_xact.pl b/src/test/recovery/t/009_fdw_xact.pl
new file mode 100644
index 0000000..79711bc
--- /dev/null
+++ b/src/test/recovery/t/009_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("maseter");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 61ce842..fa6c1b1 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2258,9 +2258,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts. We also set max_fdw_transctions to enable testing of atomic
-		 * foreign transactions. (Note: to reduce the probability of unexpected
-		 * shmmax failures, don't set max_prepared_transactions or
+		 * xacts. We also set max_prepared_foreign_transctions to enable testing
+		 * of atomic foreign transactions. (Note: to reduce the probability of
+		 * unexpected shmmax failures, don't set max_prepared_transactions or
 		 * max_prepared_foreign_transactions any higher than actually needed by the
 		 * corresponding regression tests.).
 		 */

#127

Vinayak Pokale

pokale_vinayak_q3@lab.ntt.co.jp

almost 9 years ago

In reply to: Masahiko Sawada (#126)

Re: Transactions involving multiple postgres foreign servers

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

I have tested the latest patch and it looks good to me,
so I marked it "Ready for committer".
Anyway, it would be great if anyone could also have a look at the patches and send comments.

The new status of this patch is: Ready for Committer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#128

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Vinayak Pokale (#127)

6 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale
<pokale_vinayak_q3@lab.ntt.co.jp> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

I have tested the latest patch and it looks good to me,
so I marked it "Ready for committer".
Anyway, it would be great if anyone could also have a look at the patches and send comments.

The new status of this patch is: Ready for Committer

Thank you for updating but I found a bug in 001 patch. Attached latest patches.
The differences are
* Fixed a bug.
* Ran pgindent.
* Separated the patch supporting GetPrepareID API.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

002_pgfdw_support_atomic_commit_v11.patchapplication/octet-stream; name=002_pgfdw_support_atomic_commit_v11.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..fe8500d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -226,11 +253,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
 			ereport(ERROR,
 			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 				errmsg("could not connect to server \"%s\"",
 					   server->servername),
 				errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -360,15 +401,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -576,158 +624,265 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  char *prep_info)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%s'",
+						 is_commit ? "COMMIT" : "ROLLBACK",
+						 prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -826,3 +981,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 059c5c3..ddeb0b8 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7181,3 +7209,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e8cb2d0..ba6795a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -466,6 +468,11 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1327,7 +1334,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1704,7 +1711,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2299,7 +2306,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2561,7 +2568,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3498,7 +3505,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3588,7 +3595,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3811,7 +3818,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 57dbb79..f256a92 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -116,7 +117,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -177,6 +179,12 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 8f3edc1..caf0aa2 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1706,3 +1736,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 7a9b655..8f6ab2c 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>

001_support_fdw_xact_v11.patchapplication/octet-stream; name=001_support_fdw_xact_v11.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b379b67..4a929ae 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1432,6 +1432,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index dbeaab5..639e38b 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1714,5 +1714,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension fdw_transaction_resolver
+    and including $libdir/fdw_transaction_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..869faf7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: %s",
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..ff3064e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..90d11df
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2182 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ *
+ * During replay and replication FDWXactGlobal also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdw_xact records happens by the following rules:
+ *
+ *		* On PREPARE redo we add the foreign transaction to
+ *		  FDWXactGlobal->fdw_xacts. We set fdw_xact->inredo to true for
+ *		  such entries.
+ *
+ *		* On Checkpoint redo we iterate through FDWXactGlobal->fdw_xacts.
+ *		  entries that have fdw_xact->inredo set and are behind the redo_horizon.
+ *		  We save them to disk and also set fdw_xact->ondisk to true.
+ *
+ *		* On COMMIT/ABORT we delete the entry from FDWXactGlobal->fdw_xacts.
+ *		  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *		  the disk as well.
+ *
+ *		* RecoverPreparedTransactions(), StnadbyReoverPreparedTransactions() and
+ *		  PrescanPreparedTransactions() have been modified to go through
+ *		  fdw_xact->inredo entries that have not made to disk yet.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData *FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FDWXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+			(errmsg("attempt to start transction again on server %u user %u",
+					serverid, userid)));
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								 * It doesn't appear in the in-memory entry. */
+}	FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact		fx_next;		/* Next free FDWXact entry */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FDWXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry
+										 * start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry
+										 * end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction
+														 * identifier */
+}	FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+			 serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			numFDWXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+  ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, char *fdw_xact_id);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_info);
+static int	GetFDWXactList(FDWXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+static FDWXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int			max_prepared_foreign_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData *FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	   *MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact		fdw_xacts;
+		int			cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->numFDWXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to
+	 * use two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server
+		 * remaining even if this server can execute two-phase-commit
+		 * protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell   *lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = (FDWConnection *) lfirst(lcell);
+		char	    fdw_xact_id[FDW_XACT_ID_LEN];
+		FDWXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		/* Generate prepare transaction id for foreign server */
+		FDWXactId(fdw_xact_id, "px", GetTopTransactionId(),
+				  fdw_conn->serverid, fdw_conn->userid);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id);
+
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by the foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+				  (errmsg("can not prepare transaction on foreign server %s",
+						  fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_id)
+{
+	FDWXact		fdw_xact;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+	LWLockRelease(FDWXactLock);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   FDW_XACT_ID_LEN);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FDWXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				char *fdw_xact_id)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+		errhint("Increase max_prepared_foreign_transactions (currently %d).",
+				max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->numFDWXacts < max_prepared_foreign_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int			cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FDWXactGlobal->numFDWXacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse
+	 * the order and process exits in-between those two, we will be left an
+	 * entry locked by this backend, which gets unlocked only at the server
+	 * restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact		fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact		fdw_xact = fdw_conn->fdw_xact;
+
+			fdw_xact->status = (is_commit ?
+										 FDW_XACT_COMMITTING_PREPARED :
+										 FDW_XACT_ABORTING_PREPARED);
+
+			/*
+			 * Try aborting or committing the transaction on the foreign
+			 * server
+			 */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server,
+				 * unlock it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be
+			 * executed we have tried to commit the transactions during
+			 * pre-commit phase. Any remaining transactions need to be
+			 * aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+					 is_commit ? "commit" : "abort",
+					 fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the
+	 * entries, and may not be able to unlock them if aborted in-between. In
+	 * any case, there is no reason for a foreign transaction entry to be
+	 * locked after the transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	   *entries_to_resolve;
+
+	FDWXactStatus status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+	FDW_XACT_ABORTING_PREPARED;
+
+	/*
+	 * Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to check
+	 * whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+							  get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				 ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	Assert(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		   fdw_xact->status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED) ?
+		true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FDWXactGlobal->fdw_xacts. Return NULL
+ * if foreign transacgiven does not exist.
+ */
+static FDWXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FDWXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FDWXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FDWXact entry and file if exists */
+		FDWXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->numFDWXacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->numFDWXacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->numFDWXacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+	List	   *entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->numFDWXacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+				   fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+				fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign
+				 * transaction nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the
+				 * backends. So probably, it crashed before actual abort or
+				 * commit. So assume it to be aborted.
+				 */
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the
+				 * foreign transaction. This can happen when the foreign
+				 * transaction is prepared as part of a local prepared
+				 * transaction. Just continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not
+			 * successful, unlock the entry so that someone else can pick it
+			 * up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId xid;
+	Oid			dbid;
+	Oid			serverid;
+	Oid			userid;
+	List	   *entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+		DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_remove);
+
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFDWXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FDWXactOnDiskData *fdw_xact_file_data;
+			FDWXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FDWXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FDWXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FDWXactData structure.
+ */
+void
+FDWXactRedoAdd(XLogReaderState *record)
+{
+	FDWXactOnDiskData *fdw_xact_data = (FDWXactOnDiskData *) XLogRecGetData(record);
+	FDWXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	LWLockRelease(FDWXactLock);
+}
+/*
+ * FDWXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FDWXactGlobal.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FDWXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFDWXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index f09941d..274f798 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -58,6 +58,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1455,6 +1456,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 1a1d4e5..efcb58c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1981,6 +1982,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2139,6 +2143,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2228,6 +2233,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2616,6 +2624,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4309,6 +4318,10 @@ AbortOutOfAnyTransaction(void)
 void
 RegisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = true;
 }
 
@@ -4318,6 +4331,10 @@ RegisterTransactionLocalNode(void)
 void
 UnregisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = false;
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9480377..7fec580 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5100,6 +5101,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6172,6 +6174,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 	}
 }
 
@@ -6865,7 +6870,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7490,6 +7498,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7676,6 +7685,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8983,6 +8995,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9420,7 +9437,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9437,6 +9455,7 @@ XLogReportParameters(void)
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
+			xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
@@ -9452,6 +9471,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9644,6 +9664,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9833,6 +9854,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 6511c60..15cad78 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b6552da..1ff8e6b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 68100df..3c05676 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1418,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index cd8b08f..148d19d 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
+FDWXactLock					45
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 4feb26a..ba1e8ca 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2065,6 +2066,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a02b154..27c5342 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e0c72fb..b695045 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -204,6 +204,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..f703e60 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -276,6 +276,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 27bd9b0..e64498f 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -585,6 +585,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -797,6 +798,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..41eed51 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,6 +8,7 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..0b470b4
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FDWXactId(path, prefix, xid, serverid, userid)	\
+	snprintf((path), FDW_XACT_ID_LEN + 1, "%s_%08X_%08X_%08X", (prefix), \
+			 (xid), (serverid), (userid))
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;
+	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+}	FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+extern int	max_prepared_foreign_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void FDWXactRedoAdd(XLogReaderState *record);
+extern void FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 2f43c19..62702de 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 586f340..ddb6b5f 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -74,6 +74,9 @@ extern int	synchronous_commit;
 /* Kluge for 2PC support */
 extern bool MyXactAccessedTempRel;
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 7957cab..e1afff3 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..c57a66f 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 836d6ff..2f815a5 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5308,6 +5308,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4130 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4131 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4132 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..7b95f77 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,18 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															char *prep_info);
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -220,6 +233,11 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5f38fa6..e5f9d73 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -256,11 +256,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 
 /* configurable options */
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1435a7b..843c629 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -121,4 +121,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index bd13ae6..697ff81 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index b685aeb..478260b 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2263,9 +2263,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2280,7 +2282,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

000_register_local_write_v11.patchapplication/octet-stream; name=000_register_local_write_v11.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 02e0779..1a1d4e5 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -115,6 +115,8 @@ TransactionId *ParallelCurrentXids;
  */
 bool		MyXactAccessedTempRel = false;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 
 /*
  *	transaction states - transaction state from server perspective
@@ -2158,6 +2160,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2429,6 +2433,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2613,6 +2619,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4296,6 +4304,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29c6a6e..1bd4397 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -437,6 +437,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -698,6 +701,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -995,6 +1001,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index e7d1191..586f340 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -356,6 +356,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);

004_regression_test_for_fdw_xact_v11.patchapplication/octet-stream; name=004_regression_test_for_fdw_xact_v11.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 9d03d33..b3413ce 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -19,4 +19,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/010_fdw_xact.pl b/src/test/recovery/t/010_fdw_xact.pl
new file mode 100644
index 0000000..58bcefd
--- /dev/null
+++ b/src/test/recovery/t/010_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 478260b..3e1ab6a 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2263,9 +2263,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts. We also set max_fdw_transctions to enable testing of atomic
-		 * foreign transactions. (Note: to reduce the probability of unexpected
-		 * shmmax failures, don't set max_prepared_transactions or
+		 * xacts. We also set max_prepared_foreign_transctions to enable testing
+		 * of atomic foreign transactions. (Note: to reduce the probability of
+		 * unexpected shmmax failures, don't set max_prepared_transactions or
 		 * max_prepared_foreign_transactions any higher than actually needed by the
 		 * corresponding regression tests.).
 		 */

005_get_prepare_id_api_v11.patchapplication/octet-stream; name=005_get_prepare_id_api_v11.patchDownload

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe8500d..14ab99e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -636,6 +636,23 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
  * here and UUID extension which provides the function to generate UUID is
  * not part of the core.
  */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
 /*
  * postgresPrepareForeignTransaction
@@ -644,7 +661,7 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
  */
 bool
 postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
-								  char *prep_info)
+								  int prep_info_len, char *prep_info)
 {
 	StringInfo		command;
 	PGresult		*res;
@@ -664,7 +681,8 @@ postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
 		PGconn	*conn = entry->conn;
 
 		command = makeStringInfo();
-		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_info);
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
 		res = PQexec(conn, command->data);
 		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
 
@@ -738,7 +756,8 @@ postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit
  */
 bool
 postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
-										  bool is_commit, char *prep_info)
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
 {
 	PGconn			*conn = NULL;
 
@@ -777,9 +796,9 @@ postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
 		bool			result;
 
 		command = makeStringInfo();
-		appendStringInfo(command, "%s PREPARED '%s'",
-						 is_commit ? "COMMIT" : "ROLLBACK",
-						 prep_info);
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
 		res = PQexec(conn, command->data);
 
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ba6795a..88789a4 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -469,6 +469,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
 	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
 	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
 	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
 	routine->EndForeignTransaction = postgresEndForeignTransaction;
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f256a92..721858e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -179,11 +179,13 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
-											  Oid umid, char *prep_info);
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
 extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
 													  Oid umid, bool is_commit,
-													  char *prep_info);
+													  int prep_info_len, char *prep_info);
 extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
index 869faf7..fd4957c 100644
--- a/src/backend/access/rmgrdesc/fdw_xactdesc.c
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -38,7 +38,8 @@ fdw_xact_desc(StringInfo buf, XLogReaderState *record)
 		 * TODO: we also need to assess whether we want to add this
 		 * information
 		 */
-		appendStringInfo(buf, " foreign transaction info: %s",
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
 						 fdw_insert_xlog->fdw_xact_id);
 	}
 	else
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
index 90d11df..2b35f5f 100644
--- a/src/backend/access/transam/fdw_xact.c
+++ b/src/backend/access/transam/fdw_xact.c
@@ -109,6 +109,7 @@ typedef struct
 										 * protocol while committing
 										 * transaction on this server,
 										 * whenever necessary. */
+	GetPrepareId_function get_prepare_id;
 	EndForeignTransaction_function end_foreign_xact;
 	PrepareForeignTransaction_function prepare_foreign_xact;
 	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
@@ -176,6 +177,11 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 					 errmsg("prepread foreign transactions are disabled"),
 					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
 
+		if (!fdw_routine->GetPrepareId)
+			ereport(ERROR,
+					(errmsg("no prepared transaction identifier providing function for FDW %s",
+							fdw->fdwname)));
+
 		if (!fdw_routine->PrepareForeignTransaction)
 			ereport(ERROR,
 					(errmsg("no function provided for preparing foreign transaction for FDW %s",
@@ -196,6 +202,7 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 	 * system caches are not available. So save it before hand.
 	 */
 	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
 	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
 	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
 	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
@@ -208,6 +215,9 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 	return;
 }
 
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN 256
+
 /* Enum to track the status of prepared foreign transaction */
 typedef enum
 {
@@ -250,8 +260,8 @@ typedef struct FDWXactData
 	BackendId	locking_backend;	/* Backend working on this entry */
 	bool		ondisk;			/* TRUE if prepare state file is on disk */
 	bool		inredo;			/* TRUE if entry was added via xlog_redo */
-	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction
-														 * identifier */
+	int			fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char		fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction id */
 }	FDWXactData;
 
 /* Directory where the foreign prepared transaction files will reside */
@@ -264,7 +274,7 @@ typedef struct FDWXactData
 #define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
 #define FDWXactFilePath(path, xid, serverid, userid)	\
 	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
-			 serverid, userid)
+							serverid, userid)
 
 /* Shared memory layout for maintaining foreign prepared transaction entries. */
 typedef struct
@@ -283,12 +293,12 @@ static void AtProcExit_FDWXact(int code, Datum arg);
 static bool resolve_fdw_xact(FDWXact fdw_xact,
   ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
 static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
-							   Oid umid, char *fdw_xact_id);
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id);
 static void unlock_fdw_xact(FDWXact fdw_xact);
 static void unlock_fdw_xact_entries();
 static void remove_fdw_xact(FDWXact fdw_xact);
 static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
-				  Oid umid, char *fdw_xact_info);
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
 static int	GetFDWXactList(FDWXact * fdw_xacts);
 static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
 static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
@@ -486,15 +496,17 @@ prepare_foreign_transactions(void)
 	foreach(lcell, MyFDWConnections)
 	{
 		FDWConnection *fdw_conn = (FDWConnection *) lfirst(lcell);
-		char	    fdw_xact_id[FDW_XACT_ID_LEN];
+		char	   *fdw_xact_id;
+		int			fdw_xact_id_len;
 		FDWXact		fdw_xact;
 
 		if (!fdw_conn->two_phase_commit)
 			continue;
 
-		/* Generate prepare transaction id for foreign server */
-		FDWXactId(fdw_xact_id, "px", GetTopTransactionId(),
-				  fdw_conn->serverid, fdw_conn->userid);
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
 
 		/*
 		 * Register the foreign transaction with the identifier used to
@@ -518,7 +530,8 @@ prepare_foreign_transactions(void)
 		 */
 		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
 									 fdw_conn->serverid, fdw_conn->userid,
-									 fdw_conn->umid, fdw_xact_id);
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
 
 		/*
 		 * Between register_fdw_xact call above till this backend hears back
@@ -531,7 +544,8 @@ prepare_foreign_transactions(void)
 		 * resolved by the foreign transaction resolver.
 		 */
 		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
-											fdw_conn->umid, fdw_xact_id))
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
 		{
 			/*
 			 * An error occurred, and we didn't prepare the transaction.
@@ -573,7 +587,7 @@ prepare_foreign_transactions(void)
  */
 static FDWXact
 register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
-				  Oid umid, char *fdw_xact_id)
+				  Oid umid, int fdw_xact_id_len, char *fdw_xact_id)
 {
 	FDWXact		fdw_xact;
 	FDWXactOnDiskData *fdw_xact_file_data;
@@ -582,7 +596,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	/* Enter the foreign transaction in the shared memory structure */
 	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
 	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
-							   fdw_xact_id);
+							   fdw_xact_id_len, fdw_xact_id);
 	fdw_xact->status = FDW_XACT_PREPARING;
 	fdw_xact->locking_backend = MyBackendId;
 	LWLockRelease(FDWXactLock);
@@ -595,7 +609,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	 * of the xlog record are same as what is written to the file.
 	 */
 	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
-	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
 	data_len = MAXALIGN(data_len);
 	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
 	fdw_xact_file_data->dboid = fdw_xact->dboid;
@@ -603,8 +617,9 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	fdw_xact_file_data->serverid = fdw_xact->serverid;
 	fdw_xact_file_data->userid = fdw_xact->userid;
 	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
 	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
-		   FDW_XACT_ID_LEN);
+		   fdw_xact->fdw_xact_id_len);
 
 	START_CRIT_SECTION();
 
@@ -637,7 +652,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
  */
 static FDWXact
 insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
-				char *fdw_xact_id)
+				int fdw_xact_id_len, char *fdw_xact_id)
 {
 	int i;
 	FDWXact fdw_xact;
@@ -648,6 +663,10 @@ insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid
 		fdwXactExitRegistered = true;
 	}
 
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+			 fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
 	/* Check for duplicating foreign transaction entry */
 	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
 	{
@@ -691,7 +710,8 @@ insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid
 	fdw_xact->valid = false;
 	fdw_xact->ondisk = false;
 	fdw_xact->inredo = false;
-	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
 
 	return fdw_xact;
 }
@@ -1030,6 +1050,7 @@ resolve_fdw_xact(FDWXact fdw_xact,
 
 	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
 								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
 								fdw_xact->fdw_xact_id);
 
 	/* If we succeeded in resolving the transaction, remove the entry */
@@ -1560,7 +1581,7 @@ pg_fdw_xacts(PG_FUNCTION_ARGS)
 		values[4] = CStringGetTextDatum(xact_status);
 		/* should this be really interpreted by FDW */
 		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
-												 FDW_XACT_ID_LEN));
+												 fdw_xact->fdw_xact_id_len));
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
@@ -1786,7 +1807,7 @@ pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
 		values[4] = CStringGetTextDatum(xact_status);
 		/* should this be really interpreted by FDW? */
 		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
-															 FDW_XACT_ID_LEN));
+												 fdw_xact->fdw_xact_id_len));
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
@@ -2083,6 +2104,7 @@ RecoverFDWXacts(void)
 				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
 										   serverid, userid,
 										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id_len,
 										   fdw_xact_file_data->fdw_xact_id);
 				fdw_xact->locking_backend = MyBackendId;
 				fdw_xact->status = FDW_XACT_PREPARING;
@@ -2142,7 +2164,8 @@ FDWXactRedoAdd(XLogReaderState *record)
 	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
 	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
 							   fdw_xact_data->serverid, fdw_xact_data->userid,
-							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id_len,
+							   fdw_xact_data->fdw_xact_id);
 	fdw_xact->status = FDW_XACT_PREPARING;
 	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
 	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
index 0b470b4..69b74af 100644
--- a/src/include/access/fdw_xact.h
+++ b/src/include/access/fdw_xact.h
@@ -16,11 +16,6 @@
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
 
-#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
-#define FDWXactId(path, prefix, xid, serverid, userid)	\
-	snprintf((path), FDW_XACT_ID_LEN + 1, "%s_%08X_%08X_%08X", (prefix), \
-			 (xid), (serverid), (userid))
-
 /*
  * On disk file structure
  */
@@ -33,7 +28,13 @@ typedef struct
 								 * place */
 	Oid			userid;			/* user who initiated the foreign transaction */
 	Oid			umid;
-	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+	uint32		fdw_xact_id_len;/* Length of the value stored in the next
+								 * field */
+	/* This should always be the last member */
+	char		fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];		/* variable length array
+														 * to store foreign
+														 * transaction
+														 * information. */
 }	FDWXactOnDiskData;
 
 typedef struct
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 7b95f77..fdb7b19 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -148,14 +148,20 @@ typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
 												Oid umid, bool is_commit);
 
 typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
-													Oid umid, char *prep_info);
+													Oid umid, int prep_info_len,
+													char *prep_info);
 
 typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
 															Oid userid,
 															Oid umid,
 															bool is_commit,
+															int prep_info_len,
 															char *prep_info);
 
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -234,6 +240,7 @@ typedef struct FdwRoutine
 	ImportForeignSchema_function ImportForeignSchema;
 
 	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
 	EndForeignTransaction_function EndForeignTransaction;
 	PrepareForeignTransaction_function PrepareForeignTransaction;
 	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;

003_fdw_transaction_resolver_contrib_v11.patchapplication/octet-stream; name=003_fdw_transaction_resolver_contrib_v11.patchDownload

diff --git a/contrib/fdw_transaction_resovler/Makefile b/contrib/fdw_transaction_resovler/Makefile
new file mode 100644
index 0000000..0d2e0e9
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/Makefile
@@ -0,0 +1,15 @@
+# contrib/fdw_transaction_resolver/Makefile
+
+MODULES = fdw_transaction_resolver
+PGFILEDESC = "fdw_transaction_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/fdw_transaction_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/fdw_transaction_resovler/TAGS b/contrib/fdw_transaction_resovler/TAGS
new file mode 120000
index 0000000..cf64c85
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/TAGS
@@ -0,0 +1 @@
+/home/masahiko/pgsql/source/postgresql/TAGS
\ No newline at end of file
diff --git a/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
new file mode 100644
index 0000000..f671de8
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
@@ -0,0 +1,455 @@
+/* -------------------------------------------------------------------------
+ *
+ * fdw_transaction_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/fdw_transaction_resolver/fdw_transaction_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int	fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("fdw_transaction_resolver.naptime",
+					  "Time to sleep between fdw_transaction_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag
+	 * even if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;	/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	   *dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("fdw_transaction_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);	/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM); /* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT); /* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int			rc;
+		int			naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will
+		 * restart them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid			dbid;
+
+			/* Get the database list if empty */
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+												fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	   *command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid			dbid = DatumGetObjectId(dbid_datum);
+	int			ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler
+	 * only for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function. Note that
+	 * each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time for
+	 * the statement we're about the run, and also the transaction start time.
+	 * Also, each other query sent to SPI should probably be preceded by
+	 * SetCurrentStatementStartTimestamp(), so that statement start time is
+	 * always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager, and
+	 * the PushActiveSnapshot() call creates an "active" snapshot which is
+	 * necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible through
+	 * the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List	   *dblist = NIL;
+	ListCell   *cell;
+	ListCell   *next;
+	ListCell   *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple	tup;
+	Relation	rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry from the
+	 * list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid			dbid = lfirst_oid(cell);
+		bool		exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index eaaa36c..63a33fd 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -116,6 +116,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
+ $fdw-transaction-resolver;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 6782f07..6d28cbd 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -121,6 +121,7 @@
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">
 <!ENTITY dummy-seclabel  SYSTEM "dummy-seclabel.sgml">
 <!ENTITY earthdistance   SYSTEM "earthdistance.sgml">
+<!ENTITY fdw-transaction-resolver SYSTEM "fdw-transaction-resolver.sgml">
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">

#129

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#128)

6 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Wed, Mar 22, 2017 at 2:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale
<pokale_vinayak_q3@lab.ntt.co.jp> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

I have tested the latest patch and it looks good to me,
so I marked it "Ready for committer".
Anyway, it would be great if anyone could also have a look at the patches and send comments.

The new status of this patch is: Ready for Committer

Thank you for updating but I found a bug in 001 patch. Attached latest patches.
The differences are
* Fixed a bug.
* Ran pgindent.
* Separated the patch supporting GetPrepareID API.

Since previous patches conflict with current HEAD, I attached latest
set of patches.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

000_register_local_write_v12.patchapplication/octet-stream; name=000_register_local_write_v12.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c8751c6..5ca7375 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -115,6 +115,10 @@ TransactionId *ParallelCurrentXids;
  * globally accessible, so can be set from anywhere in the code that requires
  * recording flags.
  */
+bool		MyXactAccessedTempRel = false;
+
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
 int  MyXactFlags;
 
 /*
@@ -2160,6 +2164,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2431,6 +2437,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2615,6 +2623,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4298,6 +4308,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0b524e0..661a82b 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -437,6 +437,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, oldslot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -698,6 +701,9 @@ ExecDelete(ItemPointer tupleid,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -995,6 +1001,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 5b37c05..aee1a07 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -376,6 +376,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);

005_get_prepare_id_api_v12.patchapplication/octet-stream; name=005_get_prepare_id_api_v12.patchDownload

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe8500d..14ab99e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -636,6 +636,23 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
  * here and UUID extension which provides the function to generate UUID is
  * not part of the core.
  */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4d_%d_%d",
+								"px", abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
 
 /*
  * postgresPrepareForeignTransaction
@@ -644,7 +661,7 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
  */
 bool
 postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
-								  char *prep_info)
+								  int prep_info_len, char *prep_info)
 {
 	StringInfo		command;
 	PGresult		*res;
@@ -664,7 +681,8 @@ postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
 		PGconn	*conn = entry->conn;
 
 		command = makeStringInfo();
-		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_info);
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
 		res = PQexec(conn, command->data);
 		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
 
@@ -738,7 +756,8 @@ postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit
  */
 bool
 postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
-										  bool is_commit, char *prep_info)
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
 {
 	PGconn			*conn = NULL;
 
@@ -777,9 +796,9 @@ postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
 		bool			result;
 
 		command = makeStringInfo();
-		appendStringInfo(command, "%s PREPARED '%s'",
-						 is_commit ? "COMMIT" : "ROLLBACK",
-						 prep_info);
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
 		res = PQexec(conn, command->data);
 
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5006f2a..3533579 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -469,6 +469,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
 	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
 	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
 	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
 	routine->EndForeignTransaction = postgresEndForeignTransaction;
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f256a92..721858e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -179,11 +179,13 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
-											  Oid umid, char *prep_info);
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
 extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
 													  Oid umid, bool is_commit,
-													  char *prep_info);
+													  int prep_info_len, char *prep_info);
 extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
index 869faf7..fd4957c 100644
--- a/src/backend/access/rmgrdesc/fdw_xactdesc.c
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -38,7 +38,8 @@ fdw_xact_desc(StringInfo buf, XLogReaderState *record)
 		 * TODO: we also need to assess whether we want to add this
 		 * information
 		 */
-		appendStringInfo(buf, " foreign transaction info: %s",
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
 						 fdw_insert_xlog->fdw_xact_id);
 	}
 	else
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
index 90d11df..2b35f5f 100644
--- a/src/backend/access/transam/fdw_xact.c
+++ b/src/backend/access/transam/fdw_xact.c
@@ -109,6 +109,7 @@ typedef struct
 										 * protocol while committing
 										 * transaction on this server,
 										 * whenever necessary. */
+	GetPrepareId_function get_prepare_id;
 	EndForeignTransaction_function end_foreign_xact;
 	PrepareForeignTransaction_function prepare_foreign_xact;
 	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
@@ -176,6 +177,11 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 					 errmsg("prepread foreign transactions are disabled"),
 					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
 
+		if (!fdw_routine->GetPrepareId)
+			ereport(ERROR,
+					(errmsg("no prepared transaction identifier providing function for FDW %s",
+							fdw->fdwname)));
+
 		if (!fdw_routine->PrepareForeignTransaction)
 			ereport(ERROR,
 					(errmsg("no function provided for preparing foreign transaction for FDW %s",
@@ -196,6 +202,7 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 	 * system caches are not available. So save it before hand.
 	 */
 	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
 	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
 	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
 	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
@@ -208,6 +215,9 @@ RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
 	return;
 }
 
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN 256
+
 /* Enum to track the status of prepared foreign transaction */
 typedef enum
 {
@@ -250,8 +260,8 @@ typedef struct FDWXactData
 	BackendId	locking_backend;	/* Backend working on this entry */
 	bool		ondisk;			/* TRUE if prepare state file is on disk */
 	bool		inredo;			/* TRUE if entry was added via xlog_redo */
-	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction
-														 * identifier */
+	int			fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char		fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction id */
 }	FDWXactData;
 
 /* Directory where the foreign prepared transaction files will reside */
@@ -264,7 +274,7 @@ typedef struct FDWXactData
 #define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
 #define FDWXactFilePath(path, xid, serverid, userid)	\
 	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
-			 serverid, userid)
+							serverid, userid)
 
 /* Shared memory layout for maintaining foreign prepared transaction entries. */
 typedef struct
@@ -283,12 +293,12 @@ static void AtProcExit_FDWXact(int code, Datum arg);
 static bool resolve_fdw_xact(FDWXact fdw_xact,
   ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
 static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
-							   Oid umid, char *fdw_xact_id);
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id);
 static void unlock_fdw_xact(FDWXact fdw_xact);
 static void unlock_fdw_xact_entries();
 static void remove_fdw_xact(FDWXact fdw_xact);
 static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
-				  Oid umid, char *fdw_xact_info);
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
 static int	GetFDWXactList(FDWXact * fdw_xacts);
 static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
 static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
@@ -486,15 +496,17 @@ prepare_foreign_transactions(void)
 	foreach(lcell, MyFDWConnections)
 	{
 		FDWConnection *fdw_conn = (FDWConnection *) lfirst(lcell);
-		char	    fdw_xact_id[FDW_XACT_ID_LEN];
+		char	   *fdw_xact_id;
+		int			fdw_xact_id_len;
 		FDWXact		fdw_xact;
 
 		if (!fdw_conn->two_phase_commit)
 			continue;
 
-		/* Generate prepare transaction id for foreign server */
-		FDWXactId(fdw_xact_id, "px", GetTopTransactionId(),
-				  fdw_conn->serverid, fdw_conn->userid);
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
 
 		/*
 		 * Register the foreign transaction with the identifier used to
@@ -518,7 +530,8 @@ prepare_foreign_transactions(void)
 		 */
 		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
 									 fdw_conn->serverid, fdw_conn->userid,
-									 fdw_conn->umid, fdw_xact_id);
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
 
 		/*
 		 * Between register_fdw_xact call above till this backend hears back
@@ -531,7 +544,8 @@ prepare_foreign_transactions(void)
 		 * resolved by the foreign transaction resolver.
 		 */
 		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
-											fdw_conn->umid, fdw_xact_id))
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
 		{
 			/*
 			 * An error occurred, and we didn't prepare the transaction.
@@ -573,7 +587,7 @@ prepare_foreign_transactions(void)
  */
 static FDWXact
 register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
-				  Oid umid, char *fdw_xact_id)
+				  Oid umid, int fdw_xact_id_len, char *fdw_xact_id)
 {
 	FDWXact		fdw_xact;
 	FDWXactOnDiskData *fdw_xact_file_data;
@@ -582,7 +596,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	/* Enter the foreign transaction in the shared memory structure */
 	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
 	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
-							   fdw_xact_id);
+							   fdw_xact_id_len, fdw_xact_id);
 	fdw_xact->status = FDW_XACT_PREPARING;
 	fdw_xact->locking_backend = MyBackendId;
 	LWLockRelease(FDWXactLock);
@@ -595,7 +609,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	 * of the xlog record are same as what is written to the file.
 	 */
 	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
-	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
 	data_len = MAXALIGN(data_len);
 	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
 	fdw_xact_file_data->dboid = fdw_xact->dboid;
@@ -603,8 +617,9 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
 	fdw_xact_file_data->serverid = fdw_xact->serverid;
 	fdw_xact_file_data->userid = fdw_xact->userid;
 	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
 	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
-		   FDW_XACT_ID_LEN);
+		   fdw_xact->fdw_xact_id_len);
 
 	START_CRIT_SECTION();
 
@@ -637,7 +652,7 @@ register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
  */
 static FDWXact
 insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
-				char *fdw_xact_id)
+				int fdw_xact_id_len, char *fdw_xact_id)
 {
 	int i;
 	FDWXact fdw_xact;
@@ -648,6 +663,10 @@ insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid
 		fdwXactExitRegistered = true;
 	}
 
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+			 fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
 	/* Check for duplicating foreign transaction entry */
 	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
 	{
@@ -691,7 +710,8 @@ insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid
 	fdw_xact->valid = false;
 	fdw_xact->ondisk = false;
 	fdw_xact->inredo = false;
-	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
 
 	return fdw_xact;
 }
@@ -1030,6 +1050,7 @@ resolve_fdw_xact(FDWXact fdw_xact,
 
 	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
 								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
 								fdw_xact->fdw_xact_id);
 
 	/* If we succeeded in resolving the transaction, remove the entry */
@@ -1560,7 +1581,7 @@ pg_fdw_xacts(PG_FUNCTION_ARGS)
 		values[4] = CStringGetTextDatum(xact_status);
 		/* should this be really interpreted by FDW */
 		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
-												 FDW_XACT_ID_LEN));
+												 fdw_xact->fdw_xact_id_len));
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
@@ -1786,7 +1807,7 @@ pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
 		values[4] = CStringGetTextDatum(xact_status);
 		/* should this be really interpreted by FDW? */
 		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
-															 FDW_XACT_ID_LEN));
+												 fdw_xact->fdw_xact_id_len));
 
 		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
 		result = HeapTupleGetDatum(tuple);
@@ -2083,6 +2104,7 @@ RecoverFDWXacts(void)
 				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
 										   serverid, userid,
 										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id_len,
 										   fdw_xact_file_data->fdw_xact_id);
 				fdw_xact->locking_backend = MyBackendId;
 				fdw_xact->status = FDW_XACT_PREPARING;
@@ -2142,7 +2164,8 @@ FDWXactRedoAdd(XLogReaderState *record)
 	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
 	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
 							   fdw_xact_data->serverid, fdw_xact_data->userid,
-							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id_len,
+							   fdw_xact_data->fdw_xact_id);
 	fdw_xact->status = FDW_XACT_PREPARING;
 	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
 	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
index 0b470b4..69b74af 100644
--- a/src/include/access/fdw_xact.h
+++ b/src/include/access/fdw_xact.h
@@ -16,11 +16,6 @@
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
 
-#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
-#define FDWXactId(path, prefix, xid, serverid, userid)	\
-	snprintf((path), FDW_XACT_ID_LEN + 1, "%s_%08X_%08X_%08X", (prefix), \
-			 (xid), (serverid), (userid))
-
 /*
  * On disk file structure
  */
@@ -33,7 +28,13 @@ typedef struct
 								 * place */
 	Oid			userid;			/* user who initiated the foreign transaction */
 	Oid			umid;
-	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+	uint32		fdw_xact_id_len;/* Length of the value stored in the next
+								 * field */
+	/* This should always be the last member */
+	char		fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];		/* variable length array
+														 * to store foreign
+														 * transaction
+														 * information. */
 }	FDWXactOnDiskData;
 
 typedef struct
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 7b95f77..fdb7b19 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -148,14 +148,20 @@ typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
 												Oid umid, bool is_commit);
 
 typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
-													Oid umid, char *prep_info);
+													Oid umid, int prep_info_len,
+													char *prep_info);
 
 typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
 															Oid userid,
 															Oid umid,
 															bool is_commit,
+															int prep_info_len,
 															char *prep_info);
 
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -234,6 +240,7 @@ typedef struct FdwRoutine
 	ImportForeignSchema_function ImportForeignSchema;
 
 	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
 	EndForeignTransaction_function EndForeignTransaction;
 	PrepareForeignTransaction_function PrepareForeignTransaction;
 	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;

004_regression_test_for_fdw_xact_v12.patchapplication/octet-stream; name=004_regression_test_for_fdw_xact_v12.patchDownload

diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 142a1b8..1b28f3c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -21,4 +21,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/010_fdw_xact.pl b/src/test/recovery/t/010_fdw_xact.pl
new file mode 100644
index 0000000..58bcefd
--- /dev/null
+++ b/src/test/recovery/t/010_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 478260b..3e1ab6a 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2263,9 +2263,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts. We also set max_fdw_transctions to enable testing of atomic
-		 * foreign transactions. (Note: to reduce the probability of unexpected
-		 * shmmax failures, don't set max_prepared_transactions or
+		 * xacts. We also set max_prepared_foreign_transctions to enable testing
+		 * of atomic foreign transactions. (Note: to reduce the probability of
+		 * unexpected shmmax failures, don't set max_prepared_transactions or
 		 * max_prepared_foreign_transactions any higher than actually needed by the
 		 * corresponding regression tests.).
 		 */

003_fdw_transaction_resolver_v12.patchapplication/octet-stream; name=003_fdw_transaction_resolver_v12.patchDownload

diff --git a/contrib/fdw_transaction_resovler/Makefile b/contrib/fdw_transaction_resovler/Makefile
new file mode 100644
index 0000000..0d2e0e9
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/Makefile
@@ -0,0 +1,15 @@
+# contrib/fdw_transaction_resolver/Makefile
+
+MODULES = fdw_transaction_resolver
+PGFILEDESC = "fdw_transaction_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/fdw_transaction_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/fdw_transaction_resovler/TAGS b/contrib/fdw_transaction_resovler/TAGS
new file mode 120000
index 0000000..cf64c85
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/TAGS
@@ -0,0 +1 @@
+/home/masahiko/pgsql/source/postgresql/TAGS
\ No newline at end of file
diff --git a/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
new file mode 100644
index 0000000..f671de8
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
@@ -0,0 +1,455 @@
+/* -------------------------------------------------------------------------
+ *
+ * fdw_transaction_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/fdw_transaction_resolver/fdw_transaction_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int	fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("fdw_transaction_resolver.naptime",
+					  "Time to sleep between fdw_transaction_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag
+	 * even if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	worker.bgw_main = FDWXactResolverMain;
+	worker.bgw_main_arg = (Datum) 0;	/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	   *dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("fdw_transaction_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);	/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM); /* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT); /* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int			rc;
+		int			naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will
+		 * restart them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid			dbid;
+
+			/* Get the database list if empty */
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				worker.bgw_main = FDWXactResolver_worker_main;
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+												fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	   *command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid			dbid = DatumGetObjectId(dbid_datum);
+	int			ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler
+	 * only for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function. Note that
+	 * each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time for
+	 * the statement we're about the run, and also the transaction start time.
+	 * Also, each other query sent to SPI should probably be preceded by
+	 * SetCurrentStatementStartTimestamp(), so that statement start time is
+	 * always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager, and
+	 * the PushActiveSnapshot() call creates an "active" snapshot which is
+	 * necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible through
+	 * the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List	   *dblist = NIL;
+	ListCell   *cell;
+	ListCell   *next;
+	ListCell   *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple	tup;
+	Relation	rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry from the
+	 * list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid			dbid = lfirst_oid(cell);
+		bool		exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index eaaa36c..63a33fd 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -116,6 +116,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
+ $fdw-transaction-resolver;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index 6782f07..6d28cbd 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -121,6 +121,7 @@
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">
 <!ENTITY dummy-seclabel  SYSTEM "dummy-seclabel.sgml">
 <!ENTITY earthdistance   SYSTEM "earthdistance.sgml">
+<!ENTITY fdw-transaction-resolver SYSTEM "fdw-transaction-resolver.sgml">
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">

002_pgfdw_support_atomic_commit_v12.patchapplication/octet-stream; name=002_pgfdw_support_atomic_commit_v12.patchDownload

diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..fe8500d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,7 +14,9 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -64,16 +66,19 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
 					   void *arg);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -86,6 +91,9 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
  *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
  * XXX Note that caching connections theoretically requires a mechanism to
  * detect change of FDW objects to invalidate already established connections.
  * We could manage that by watching for invalidation events on the relevant
@@ -94,7 +102,8 @@ static void pgfdw_subxact_callback(SubXactEvent event,
  * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,9 +131,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -159,7 +165,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;	/* just to be sure */
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -168,7 +187,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -178,9 +202,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -226,11 +253,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
 			ereport(ERROR,
 			   (errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
 				errmsg("could not connect to server \"%s\"",
 					   server->servername),
 				errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -360,15 +401,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -576,158 +624,265 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
  */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
+
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  char *prep_info)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
 
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit, char *prep_info)
+{
+	PGconn			*conn = NULL;
 
 	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
 	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
+	if (ConnectionHash)
 	{
-		PGresult   *res;
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
 
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
 
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%s'",
+						 is_commit ? "COMMIT" : "ROLLBACK",
+						 prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
 		{
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
 
-			switch (event)
+			if (diag_sqlstate)
 			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-					/* Commit all remote transactions during pre-commit */
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE)
-					{
-						PGcancel   *cancel;
-						char		errbuf[256];
-
-						if ((cancel = PQgetCancel(entry->conn)))
-						{
-							if (!PQcancel(cancel, errbuf, sizeof(errbuf)))
-								ereport(WARNING,
-										(errcode(ERRCODE_CONNECTION_FAILURE),
-								  errmsg("could not send cancel request: %s",
-										 errbuf)));
-							PQfreeCancel(cancel);
-						}
-					}
-
-					/* If we're aborting, abort all remote transactions too */
-					res = PQexec(entry->conn, "ABORT TRANSACTION");
-					/* Note: can't throw ERROR, it would be infinite loop */
-					if (PQresultStatus(res) != PGRES_COMMAND_OK)
-						pgfdw_report_error(WARNING, res, entry->conn, true,
-										   "ABORT TRANSACTION");
-					else
-					{
-						PQclear(res);
-						/* As above, make sure to clear any prepared stmts */
-						if (entry->have_prep_stmt && entry->have_error)
-						{
-							res = PQexec(entry->conn, "DEALLOCATE ALL");
-							PQclear(res);
-						}
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-					break;
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
 			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
 		}
+		else
+			result = true;
 
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
 
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			PQfinish(entry->conn);
-			entry->conn = NULL;
-		}
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
 	}
 
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
 	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
+ * pgfdw_xact_callback --- cleanup at main-transaction end.
+ */
+static void
+pgfdw_xact_callback(XactEvent event, void *arg)
+{
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -826,3 +981,26 @@ pgfdw_subxact_callback(SubXactEvent event, SubTransactionId mySubid,
 		entry->xact_depth--;
 	}
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a466bf2..5875d52 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Now we should be able to run ANALYZE.
 -- To exercise multiple code paths, we use local stats on ft1
@@ -7186,3 +7214,139 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+\des+
+                                                                                                                                                                                                                                                      List of foreign servers
+    Name     |  Owner   | Foreign-data wrapper | Access privileges | Type | Version |                                                                                                                                                                                                          FDW Options                                                                                                                                                                                                           | Description 
+-------------+----------+----------------------+-------------------+------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
+ loopback    | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', extensions 'postgres_fdw', two_phase_commit 'off')                                                                                                                                                                                                                                                                                                                                 | 
+ loopback2   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ loopback3   | masahiko | postgres_fdw         |                   |      |         | (dbname 'contrib_regression', port '50848', two_phase_commit 'on')                                                                                                                                                                                                                                                                                                                                                             | 
+ testserver1 | masahiko | postgres_fdw         |                   |      |         | (use_remote_estimate 'false', updatable 'true', fdw_startup_cost '123.456', fdw_tuple_cost '0.123', service 'value', connect_timeout 'value', dbname 'value', host 'value', hostaddr 'value', port 'value', application_name 'value', keepalives 'value', keepalives_idle 'value', keepalives_interval 'value', sslcompression 'value', sslmode 'value', sslcert 'value', sslkey 'value', sslrootcert 'value', sslcrl 'value') | 
+(4 rows)
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index e24db56..c048c0d 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 03f1480..5006f2a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -466,6 +468,11 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1327,7 +1334,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1704,7 +1711,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2299,7 +2306,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2561,7 +2568,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3498,7 +3505,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3588,7 +3595,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3811,7 +3818,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 57dbb79..f256a92 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -116,7 +117,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -177,6 +179,12 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						RelOptInfo *foreignrel, List *tlist,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 8f3edc1..caf0aa2 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1706,3 +1736,77 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+\des+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 7a9b655..8f6ab2c 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -426,6 +426,42 @@
     foreign tables, see <xref linkend="sql-createforeigntable">.
    </para>
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>

001_support_fdw_xact_v12.patchapplication/octet-stream; name=001_support_fdw_xact_v12.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index ac339fb..09f67a3 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1432,6 +1432,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index dbeaab5..639e38b 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1714,5 +1714,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension fdw_transaction_resolver
+    and including $libdir/fdw_transaction_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..869faf7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: %s",
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 5f07eb1..ff3064e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..90d11df
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2182 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ *
+ * During replay and replication FDWXactGlobal also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdw_xact records happens by the following rules:
+ *
+ *		* On PREPARE redo we add the foreign transaction to
+ *		  FDWXactGlobal->fdw_xacts. We set fdw_xact->inredo to true for
+ *		  such entries.
+ *
+ *		* On Checkpoint redo we iterate through FDWXactGlobal->fdw_xacts.
+ *		  entries that have fdw_xact->inredo set and are behind the redo_horizon.
+ *		  We save them to disk and also set fdw_xact->ondisk to true.
+ *
+ *		* On COMMIT/ABORT we delete the entry from FDWXactGlobal->fdw_xacts.
+ *		  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *		  the disk as well.
+ *
+ *		* RecoverPreparedTransactions(), StnadbyReoverPreparedTransactions() and
+ *		  PrescanPreparedTransactions() have been modified to go through
+ *		  fdw_xact->inredo entries that have not made to disk yet.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData *FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FDWXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+			(errmsg("attempt to start transction again on server %u user %u",
+					serverid, userid)));
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								 * It doesn't appear in the in-memory entry. */
+}	FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact		fx_next;		/* Next free FDWXact entry */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FDWXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry
+										 * start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry
+										 * end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction
+														 * identifier */
+}	FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+			 serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			numFDWXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+  ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, char *fdw_xact_id);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_info);
+static int	GetFDWXactList(FDWXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+static FDWXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int			max_prepared_foreign_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData *FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	   *MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact		fdw_xacts;
+		int			cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->numFDWXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to
+	 * use two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server
+		 * remaining even if this server can execute two-phase-commit
+		 * protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell   *lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = (FDWConnection *) lfirst(lcell);
+		char	    fdw_xact_id[FDW_XACT_ID_LEN];
+		FDWXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		/* Generate prepare transaction id for foreign server */
+		FDWXactId(fdw_xact_id, "px", GetTopTransactionId(),
+				  fdw_conn->serverid, fdw_conn->userid);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id);
+
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by the foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+				  (errmsg("can not prepare transaction on foreign server %s",
+						  fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_id)
+{
+	FDWXact		fdw_xact;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+	LWLockRelease(FDWXactLock);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   FDW_XACT_ID_LEN);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FDWXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				char *fdw_xact_id)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+		errhint("Increase max_prepared_foreign_transactions (currently %d).",
+				max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->numFDWXacts < max_prepared_foreign_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int			cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FDWXactGlobal->numFDWXacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse
+	 * the order and process exits in-between those two, we will be left an
+	 * entry locked by this backend, which gets unlocked only at the server
+	 * restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact		fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact		fdw_xact = fdw_conn->fdw_xact;
+
+			fdw_xact->status = (is_commit ?
+										 FDW_XACT_COMMITTING_PREPARED :
+										 FDW_XACT_ABORTING_PREPARED);
+
+			/*
+			 * Try aborting or committing the transaction on the foreign
+			 * server
+			 */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server,
+				 * unlock it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be
+			 * executed we have tried to commit the transactions during
+			 * pre-commit phase. Any remaining transactions need to be
+			 * aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+					 is_commit ? "commit" : "abort",
+					 fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the
+	 * entries, and may not be able to unlock them if aborted in-between. In
+	 * any case, there is no reason for a foreign transaction entry to be
+	 * locked after the transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	   *entries_to_resolve;
+
+	FDWXactStatus status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+	FDW_XACT_ABORTING_PREPARED;
+
+	/*
+	 * Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to check
+	 * whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+							  get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				 ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	Assert(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		   fdw_xact->status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED) ?
+		true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FDWXactGlobal->fdw_xacts. Return NULL
+ * if foreign transacgiven does not exist.
+ */
+static FDWXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FDWXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FDWXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FDWXact entry and file if exists */
+		FDWXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->numFDWXacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->numFDWXacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->numFDWXacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+	List	   *entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->numFDWXacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+				   fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+				fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign
+				 * transaction nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the
+				 * backends. So probably, it crashed before actual abort or
+				 * commit. So assume it to be aborted.
+				 */
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the
+				 * foreign transaction. This can happen when the foreign
+				 * transaction is prepared as part of a local prepared
+				 * transaction. Just continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not
+			 * successful, unlock the entry so that someone else can pick it
+			 * up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+															 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId xid;
+	Oid			dbid;
+	Oid			serverid;
+	Oid			userid;
+	List	   *entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+		DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_remove);
+
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFDWXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FDWXactOnDiskData *fdw_xact_file_data;
+			FDWXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FDWXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FDWXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FDWXactData structure.
+ */
+void
+FDWXactRedoAdd(XLogReaderState *record)
+{
+	FDWXactOnDiskData *fdw_xact_data = (FDWXactOnDiskData *) XLogRecGetData(record);
+	FDWXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	LWLockRelease(FDWXactLock);
+}
+/*
+ * FDWXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FDWXactGlobal.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FDWXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFDWXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 83169cc..98f847b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -58,6 +58,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1455,6 +1456,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 5ca7375..d62a9b2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1985,6 +1986,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2143,6 +2147,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true);
 	AtCommit_ApplyLauncher();
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2232,6 +2237,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2620,6 +2628,7 @@ AbortTransaction(void)
 		AtEOXact_ComboCid();
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4313,6 +4322,10 @@ AbortOutOfAnyTransaction(void)
 void
 RegisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = true;
 }
 
@@ -4322,6 +4335,10 @@ RegisterTransactionLocalNode(void)
 void
 UnregisterTransactionLocalNode(void)
 {
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
 	XactWriteLocalNode = false;
 }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5d58f09..f862369 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5104,6 +5105,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6176,6 +6178,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 	}
 }
 
@@ -6870,7 +6875,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7495,6 +7503,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7681,6 +7690,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -8993,6 +9005,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9430,7 +9447,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9447,6 +9465,7 @@ XLogReportParameters(void)
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
+			xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
@@ -9462,6 +9481,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9658,6 +9678,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9847,6 +9868,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 46c207c..2da7369 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/index.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index d357c8b..bf5fbc1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -301,6 +301,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 68100df..3c05676 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1418,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 5c13d26..5b09f1d 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 3e13394..cdf2d8d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -494,7 +494,7 @@ RegisterLWLockTranches(void)
 
 	if (LWLockTrancheArray == NULL)
 	{
-		LWLockTranchesAllocated = 64;
+		LWLockTranchesAllocated = 65;
 		LWLockTrancheArray = (char **)
 			MemoryContextAllocZero(TopMemoryContext,
 						  LWLockTranchesAllocated * sizeof(char *));
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..8e7028a 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+FDWXactLock					46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e9d561b..bab9a23 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2065,6 +2066,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8a93bdc..1be8858 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -118,6 +118,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 8dde1e8..f0fa78a 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -205,6 +205,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 2ea8931..f703e60 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -276,6 +276,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index bcb9ed9..739a475 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -585,6 +585,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -797,6 +798,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..41eed51 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,6 +8,7 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..0b470b4
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,75 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FDWXactId(path, prefix, xid, serverid, userid)	\
+	snprintf((path), FDW_XACT_ID_LEN + 1, "%s_%08X_%08X_%08X", (prefix), \
+			 (xid), (serverid), (userid))
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;
+	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+}	FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+extern int	max_prepared_foreign_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void FDWXactRedoAdd(XLogReaderState *record);
+extern void FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 2f43c19..62702de 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index aee1a07..f30374d 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -93,6 +93,9 @@ extern int  MyXactFlags;
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK	(1U << 1)
 
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index c09c0f8..be6a412 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -213,6 +213,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3a25cc8..c57a66f 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -182,6 +182,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 79f9b90..27f0adb 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5324,6 +5324,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4130 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4131 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4132 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..7b95f77 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,18 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 														   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															char *prep_info);
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 													  ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -220,6 +233,11 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1a125d8..f59ecbb 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -266,11 +266,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 /* configurable options */
 extern int	DeadlockTimeout;
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 1435a7b..843c629 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -121,4 +121,8 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
 #endif   /* BUILTINS_H */
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index d706f42..06102ff 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index b685aeb..478260b 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2263,9 +2263,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_fdw_transctions to enable testing of atomic
+		 * foreign transactions. (Note: to reduce the probability of unexpected
+		 * shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2280,7 +2282,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

#130

Masahiko Sawada

sawada.mshk@gmail.com

almost 9 years ago

In reply to: Masahiko Sawada (#129)

Re: Transactions involving multiple postgres foreign servers

On Wed, Mar 29, 2017 at 11:14 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Mar 22, 2017 at 2:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Mar 16, 2017 at 2:37 PM, Vinayak Pokale
<pokale_vinayak_q3@lab.ntt.co.jp> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

I have tested the latest patch and it looks good to me,
so I marked it "Ready for committer".
Anyway, it would be great if anyone could also have a look at the patches and send comments.

The new status of this patch is: Ready for Committer

Thank you for updating but I found a bug in 001 patch. Attached latest patches.
The differences are
* Fixed a bug.
* Ran pgindent.
* Separated the patch supporting GetPrepareID API.

Since previous patches conflict with current HEAD, I attached latest
set of patches.

Vinayak, why did you marked this patch as "Move to next CF"? AFAIU
there is not discussion yet.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#131

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#130)

Re: Transactions involving multiple postgres foreign servers

On Fri, Apr 7, 2017 at 10:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Vinayak, why did you marked this patch as "Move to next CF"? AFAIU
there is not discussion yet.

I'd like to discuss this patch. Clearly, a lot of work has been done
here, but I am not sure about the approach.

If we were to commit this patch set, then you could optionally enable
two_phase_commit for a postgres_fdw foreign server. If you did, then,
modulo bugs and administrator shenanigans, and given proper
configuration, you would be guaranteed that a successful commit of a
transaction which touched postgres_fdw foreign tables would eventually
end up committed or rolled back on all of the nodes, rather than
committed on some and rolled back on others. However, you would not
be guaranteed that all of those commits or rollbacks happen at
anything like the same time. So, you would have a sort of eventual
consistency. Any given snapshot might not be consistent, but if you
waited long enough and with all nodes online, eventually all
distributed transactions would be resolved in a consistent manner.
That's kinda cool, but I think what people really want is a stronger
guarantee, namely, that they will get consistent snapshots. It's not
clear to me that this patch gets us any closer to that goal. Does
anyone have a plan for how we'd get from here to that stronger goal?
If not, is the patch useful enough to justify committing it for what
it can already do? It would be particularly good to hear some
end-user views on this functionality and whether or not they would use
it and find it valuable.

On a technical level, I am pretty sure that it is not OK to call
AtEOXact_FDWXacts() from the sections of CommitTransaction,
AbortTransaction, and PrepareTransaction that are described as
"non-critical resource releasing". At that point, it's too late to
throw an error, and it is very difficult to imagine something that
involves a TCP connection to another machine not being subject to
error. You might say "well, we can just make sure that any problems
are reporting as a WARNING rather than an ERROR", but that's pretty
hard to guarantee; most backend code assumes it can ERROR, so anything
you call is a potential hazard. There is a second problem, too: any
code that runs from here is not interruptible. The user can hit ^C
all day and nothing will happen. That's a bad situation when you're
busy doing network I/O. I'm not exactly sure what the best thing to
do about this problem would be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#132

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#131)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 27, 2017 at 10:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Apr 7, 2017 at 10:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Vinayak, why did you marked this patch as "Move to next CF"? AFAIU
there is not discussion yet.

I'd like to discuss this patch. Clearly, a lot of work has been done
here, but I am not sure about the approach.

Thank you for the comment. I'd like to reply about the goal of this
feature first.

If we were to commit this patch set, then you could optionally enable
two_phase_commit for a postgres_fdw foreign server. If you did, then,
modulo bugs and administrator shenanigans, and given proper
configuration, you would be guaranteed that a successful commit of a
transaction which touched postgres_fdw foreign tables would eventually
end up committed or rolled back on all of the nodes, rather than
committed on some and rolled back on others. However, you would not
be guaranteed that all of those commits or rollbacks happen at
anything like the same time. So, you would have a sort of eventual
consistency. Any given snapshot might not be consistent, but if you
waited long enough and with all nodes online, eventually all
distributed transactions would be resolved in a consistent manner.
That's kinda cool, but I think what people really want is a stronger
guarantee, namely, that they will get consistent snapshots. It's not
clear to me that this patch gets us any closer to that goal. Does
anyone have a plan for how we'd get from here to that stronger goal?
If not, is the patch useful enough to justify committing it for what
it can already do? It would be particularly good to hear some
end-user views on this functionality and whether or not they would use
it and find it valuable.

Yeah, this patch only guarantees that if you got a commit the
transaction either committed or rollback-ed on all relevant nodes.
And subsequent transactions can see a consistent result (if the server
failed we have to recover in-doubt transactions properly from a
crash). But it doesn't guarantees that a concurrent transaction can
see a consistent result. To provide seeing cluster-wide consistent
result, I think we need a transaction manager for distributed queries
which is responsible for providing consistent snapshots. There were
some discussions of the type of transaction manager but at least we
need a new transaction manager for distributed queries. I think the
providing a consistent result to concurrent transactions and the
committing or rollback-ing atomically a transaction should be
separated features, and should be discussed separately. It's not
useful and users would complain if we provide a consistent snapshot
but a distributed transaction could commit on part of nodes. So this
patch could be also an important feature for providing consistent
result.

On a technical level, I am pretty sure that it is not OK to call
AtEOXact_FDWXacts() from the sections of CommitTransaction,
AbortTransaction, and PrepareTransaction that are described as
"non-critical resource releasing". At that point, it's too late to
throw an error, and it is very difficult to imagine something that
involves a TCP connection to another machine not being subject to
error. You might say "well, we can just make sure that any problems
are reporting as a WARNING rather than an ERROR", but that's pretty
hard to guarantee; most backend code assumes it can ERROR, so anything
you call is a potential hazard. There is a second problem, too: any
code that runs from here is not interruptible. The user can hit ^C
all day and nothing will happen. That's a bad situation when you're
busy doing network I/O. I'm not exactly sure what the best thing to
do about this problem would be.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#133

Stas Kelvich

s.kelvich@postgrespro.ru

over 8 years ago

In reply to: Robert Haas (#131)

Re: Transactions involving multiple postgres foreign servers

On 27 Jul 2017, at 04:28, Robert Haas <robertmhaas@gmail.com> wrote:

However, you would not
be guaranteed that all of those commits or rollbacks happen at
anything like the same time. So, you would have a sort of eventual
consistency.

As far as I understand any solution that provides proper isolation for distributed
transactions in postgres will require distributed 2PC somewhere under the hood.
That is just consequence of parallelism that database allows — transactions can
abort due concurrent operations. So dichotomy is simple: either we need 2PC or
restrict write transactions to be physically serial.

In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to
commit distributed transaction.

Some years ago we created patches to implement transaction manager API and
that is just a way to inject consistent snapshots on different nodes, but atomic
commit itself is out of scope of TM API (hmm, may be it is better to call it snapshot
manager api?). That allows us to use it in quite different environments like fdw and
logical replication and both are using 2PC.

I want to submit TM API again during this release cycle along with implementation
for fdw. And I planned to base it on top of this patch. So I already rebased Masahiko’s
patch to current -master and started writing long list of nitpicks, but not finished yet.

Also I see the quite a big value in this patch even without tm/snapshots/whatever.
Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t
doing cross-node analytical transactions it will be safe to live without isolation.
But living without atomicity means that some parts of data can be lost without simple
way to detect and fix that.

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#134

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Stas Kelvich (#133)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 27, 2017 at 5:08 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

As far as I understand any solution that provides proper isolation for distributed
transactions in postgres will require distributed 2PC somewhere under the hood.
That is just consequence of parallelism that database allows — transactions can
abort due concurrent operations. So dichotomy is simple: either we need 2PC or
restrict write transactions to be physically serial.

In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to
commit distributed transaction.

Ah, OK. I was imagining that a transaction manager might be
responsible for managing both snapshots and distributed commit. But
if the transaction manager only handles the snapshots (how?) and the
commit has to be done using 2PC, then we need this.

Also I see the quite a big value in this patch even without tm/snapshots/whatever.
Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t
doing cross-node analytical transactions it will be safe to live without isolation.
But living without atomicity means that some parts of data can be lost without simple
way to detect and fix that.

OK, thanks for weighing in.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#135

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 8 years ago

In reply to: Robert Haas (#131)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 27, 2017 at 6:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On a technical level, I am pretty sure that it is not OK to call
AtEOXact_FDWXacts() from the sections of CommitTransaction,
AbortTransaction, and PrepareTransaction that are described as
"non-critical resource releasing". At that point, it's too late to
throw an error, and it is very difficult to imagine something that
involves a TCP connection to another machine not being subject to
error. You might say "well, we can just make sure that any problems
are reporting as a WARNING rather than an ERROR", but that's pretty
hard to guarantee; most backend code assumes it can ERROR, so anything
you call is a potential hazard. There is a second problem, too: any
code that runs from here is not interruptible. The user can hit ^C
all day and nothing will happen. That's a bad situation when you're
busy doing network I/O. I'm not exactly sure what the best thing to
do about this problem would be.

The remote transaction can be committed/aborted only after the fate of
the local transaction is decided. If we commit remote transaction and
abort local transaction, that's not good. AtEOXact* functions are
called immediately after that decision in post-commit/abort phase. So,
if we want to commit/abort the remote transaction immediately it has
to be done in post-commit/abort processing. Instead if we delegate
that to the remote transaction resolved backend (introduced by the
patches) the delay between local commit and remote commits depends
upon when the resolve gets a chance to run and process those
transactions. One could argue that that delay would anyway exist when
post-commit/abort processing fails to resolve remote transaction. But
given the real high availability these days, in most of the cases
remote transaction will be resolved in the post-commit/abort phase. I
think we should optimize for most common case. Your concern is still
valid, that we shouldn't raise an error or do anything critical in
post-commit/abort phase. So we should device a way to send
COMMIT/ABORT prepared messages to the remote server in asynchronous
fashion carefully avoiding errors. Recent changes to 2PC have improved
performance in that area to a great extent. Relying on resolver
backend to resolve remote transactions would erode that performance
gain.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#136

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#134)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 27, 2017 at 8:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 27, 2017 at 5:08 AM, Stas Kelvich <s.kelvich@postgrespro.ru> wrote:

As far as I understand any solution that provides proper isolation for distributed
transactions in postgres will require distributed 2PC somewhere under the hood.
That is just consequence of parallelism that database allows — transactions can
abort due concurrent operations. So dichotomy is simple: either we need 2PC or
restrict write transactions to be physically serial.

In particular both Postgres-XL/XC and postgrespro multimaster are using 2PC to
commit distributed transaction.

Ah, OK. I was imagining that a transaction manager might be
responsible for managing both snapshots and distributed commit. But
if the transaction manager only handles the snapshots (how?) and the
commit has to be done using 2PC, then we need this.

One way to provide snapshots to participant nodes is giving a snapshot
data to them using libpq protocol with the query when coordinator
nodes starts transaction on a remote node (or we now can use exporting
snapshot infrastructure?). IIUC Postgres-XL/XC uses this approach.
That also requires to share the same XID space with all remote nodes.
Perhaps the CSN based snapshot can make this more simple.

Also I see the quite a big value in this patch even without tm/snapshots/whatever.
Right now fdw doesn’t guarantee neither isolation nor atomicity. And if one isn’t
doing cross-node analytical transactions it will be safe to live without isolation.
But living without atomicity means that some parts of data can be lost without simple
way to detect and fix that.

OK, thanks for weighing in.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#137

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#136)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jul 28, 2017 at 7:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

That also requires to share the same XID space with all remote nodes.

You are putting your finger on the main bottleneck with global
consistency that XC and XL has because of that. And the source feeding
the XIDs is a SPOF.

Perhaps the CSN based snapshot can make this more simple.

Hm. This needs a closer look.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#138

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Ashutosh Bapat (#135)

Re: Transactions involving multiple postgres foreign servers

On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The remote transaction can be committed/aborted only after the fate of
the local transaction is decided. If we commit remote transaction and
abort local transaction, that's not good. AtEOXact* functions are
called immediately after that decision in post-commit/abort phase. So,
if we want to commit/abort the remote transaction immediately it has
to be done in post-commit/abort processing. Instead if we delegate
that to the remote transaction resolved backend (introduced by the
patches) the delay between local commit and remote commits depends
upon when the resolve gets a chance to run and process those
transactions. One could argue that that delay would anyway exist when
post-commit/abort processing fails to resolve remote transaction. But
given the real high availability these days, in most of the cases
remote transaction will be resolved in the post-commit/abort phase. I
think we should optimize for most common case. Your concern is still
valid, that we shouldn't raise an error or do anything critical in
post-commit/abort phase. So we should device a way to send
COMMIT/ABORT prepared messages to the remote server in asynchronous
fashion carefully avoiding errors. Recent changes to 2PC have improved
performance in that area to a great extent. Relying on resolver
backend to resolve remote transactions would erode that performance
gain.

I think there are two separate but interconnected issues here. One is
that if we give the user a new command prompt without resolving the
remote transaction, then they might run a new query that sees their
own work as committed, which would be bad. Or, they might commit,
wait for the acknowledgement, and then tell some other session to go
look at the data, and find it not there. That would also be bad. I
think the solution is likely to do something like what we did for
synchronous replication in commit
9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove
transaction to be resolved (by the background process) but allow an
interrupt to escape the wait-loop.

The second issue is that having the resolver resolve transactions
might be slower than doing it in the foreground. I don't necessarily
see a reason why that should be a big problem. I mean, the resolver
might need to establish a separate connection, but if it keeps that
connection open for a while (say, 5 minutes) in case further
transactions arrive then it won't be an issue except on really
low-volume system which isn't really a case I think we need to worry
about very much. Also, the hand-off to the resolver might take some
time, but that's equally true for sync rep and we're living with it
there. Anything else is presumably just the resolver itself being
inefficient which seems like something that can simply be fixed.

FWIW, I don't think the present resolver implementation is likely to
be what we want. IIRC, it's just calling an SQL function which
doesn't seem like a good approach. Ideally we should stick an entry
into a shared memory queue and then ping the resolver via SetLatch,
and it can directly invoke an FDW method on the data from the shared
memory queue. It should be possible to set things up so that a user
who wishes to do so can run multiple copies of the resolver thread at
the same time, which would be a good way to keep latency down if the
system is very busy with distributed transactions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#139

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Michael Paquier (#137)

Re: Transactions involving multiple postgres foreign servers

On Fri, Jul 28, 2017 at 10:14 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Fri, Jul 28, 2017 at 7:28 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

That also requires to share the same XID space with all remote nodes.

You are putting your finger on the main bottleneck with global
consistency that XC and XL has because of that. And the source feeding
the XIDs is a SPOF.

Perhaps the CSN based snapshot can make this more simple.

Hm. This needs a closer look.

With or without CSNs, sharing the same XID space across all nodes is
undesirable in a loosely-coupled network. If only a small fraction of
transactions are distributed, incurring the overhead of synchronizing
XID assignment for every transaction is not good. Suppose node A
processes many transactions and node B only a few transactions; then,
XID advancement caused by node A forces node B to perform vacuum for
wraparound. Not fun. Or, if you have an OLTP workload running on A
and an OLTP workload running B that are independent of each other, and
occasional reporting queries that scan both, you'll be incurring the
overhead of keeping A and B consistent for a lot of transactions that
don't need it. Of course, when A and B are tightly coupled and
basically all transactions are scanning both, locking the XID space
together *may* be the best approach, but even then there are notable
disadvantages - e.g. they can't both continue processing write
transactions if the connection between the two is severed.

An alternative approach is to have some kind of other identifier,
let's call it a distributed transaction ID (DXID) which is mapped by
each node onto a local XID.

Regardless of whether we share XIDs or DXIDs, we need a more complex
concept of transaction state than we have now. Right now,
transactions are basically in-progress, committed, or aborted, but
there's also the state where the status of the transaction is known by
someone but not locally. You can imagine something like: during the
prepare phase, all nodes set the status in clog to "prepared". Then,
if that succeeds, the leader changes the status to "committed" or
"aborted" and tells all nodes to do the same. Thereafter, any time
someone inquires about the status of that transaction, we go ask all
of the other nodes in the cluster; if they all think it's prepared,
then it's prepared -- but if any of them think it's committed or
aborted, then we change our local status to match and return that
status. So once one node changes the status to committed or aborted
it can propagate through the cluster even if connectivity is lost for
a while.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#140

Alvaro Herrera

alvherre@2ndquadrant.com

over 8 years ago

In reply to: Robert Haas (#139)

Re: Transactions involving multiple postgres foreign servers

Robert Haas wrote:

An alternative approach is to have some kind of other identifier,
let's call it a distributed transaction ID (DXID) which is mapped by
each node onto a local XID.

Postgres-XL seems to manage this problem by using a transaction manager
node, which is in charge of assigning snapshots. I don't know how that
works, but perhaps adding that concept here could be useful too. One
critical point to that design is that the app connects not directly to
the underlying Postgres server but instead to some other node which is
or connects to the node that manages the snapshots.

Maybe Michael can explain in better detail how it works, and/or how (and
if) it could be applied here.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#141

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Alvaro Herrera (#140)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jul 31, 2017 at 1:27 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Postgres-XL seems to manage this problem by using a transaction manager
node, which is in charge of assigning snapshots. I don't know how that
works, but perhaps adding that concept here could be useful too. One
critical point to that design is that the app connects not directly to
the underlying Postgres server but instead to some other node which is
or connects to the node that manages the snapshots.

Maybe Michael can explain in better detail how it works, and/or how (and
if) it could be applied here.

I suspect that if you've got a central coordinator server that is the
jumping-off point for all distributed transactions, the Postgres-XL
approach is hard to beat (at least in concept, not sure about the
implementation). That server is brokering all of the connections to
the data nodes anyway, so it might as well tell them all what
snapshots to use while it's there. When you scale to multiple
coordinators, though, it's less clear that it's the best approach.
Now one coordinator has to be the GTM master, and that server is
likely to become a bottleneck -- plus talking to it involves extra
network hops for all the other coordinators. When you then move the
needle a bit further and imagine a system where the idea of a
coordinator doesn't even exist, and you've just got a loosely couple
distributed system where distributed transactions might arrive on any
node, all of which are also servicing local transactions, then it
seems pretty likely that the Postgres-XL approach is not the best fit.

We might want to support multiple models. Which one to support first
is a harder question. The thing I like least about the Postgres-XC
approach is it seems inevitable that, as Michael says, the central
server handing out XIDs and snapshots is bound to become a bottleneck.
That type of system implicitly constructs a total order of all
distributed transactions, but we don't really need a total order. If
two transactions don't touch the same data and there's no overlapping
transaction that can notice the commit order, then we could make those
commit decisions independently on different nodes without caring which
one "happens first". The problem is that it might take so much
bookkeeping to figure out whether that is in fact the case in a
particular instance that it's even more expensive than having a
central server that functions as a global bottleneck.

It might be worth some study not only of Postgres-XL but also of other
databases that claim to provide distributed transactional consistency
across nodes. I've found literature on this topic from time to time
over the years, but I'm not sure what the best practices in this area
actually are. https://en.wikipedia.org/wiki/Global_serializability
claims that a technique called Commitment Ordering (CO) is teh
awesome, but I've got my doubts about whether that's really an
objective description of the state of the art. One clue is that the
global serialiazability article says three separate times that the
technique has been widely misunderstood. I'm not sure exactly which
Wikipedia guideline that violates, but I think Wikipedia is supposed
to summarize the views that exist on a topic in accordance with their
prevalence, not take a position on which view is correct.
https://en.wikipedia.org/wiki/Commitment_ordering contains citations
from the papers only of one guy, Yoav Raz, which is another hint that
this may not be as widely-regarded a technique as the person who wrote
these articles thinks it should be. Anyway, it would be good to
understand what other well-regarded systems do before we choose what
we want to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#142

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#141)

Re: Transactions involving multiple postgres foreign servers

On Tue, Aug 1, 2017 at 3:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jul 31, 2017 at 1:27 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Postgres-XL seems to manage this problem by using a transaction manager
node, which is in charge of assigning snapshots. I don't know how that
works, but perhaps adding that concept here could be useful too. One
critical point to that design is that the app connects not directly to
the underlying Postgres server but instead to some other node which is
or connects to the node that manages the snapshots.

Maybe Michael can explain in better detail how it works, and/or how (and
if) it could be applied here.

I suspect that if you've got a central coordinator server that is the
jumping-off point for all distributed transactions, the Postgres-XL
approach is hard to beat (at least in concept, not sure about the
implementation). That server is brokering all of the connections to
the data nodes anyway, so it might as well tell them all what
snapshots to use while it's there. When you scale to multiple
coordinators, though, it's less clear that it's the best approach.
Now one coordinator has to be the GTM master, and that server is
likely to become a bottleneck -- plus talking to it involves extra
network hops for all the other coordinators. When you then move the
needle a bit further and imagine a system where the idea of a
coordinator doesn't even exist, and you've just got a loosely couple
distributed system where distributed transactions might arrive on any
node, all of which are also servicing local transactions, then it
seems pretty likely that the Postgres-XL approach is not the best fit.

We might want to support multiple models. Which one to support first
is a harder question. The thing I like least about the Postgres-XC
approach is it seems inevitable that, as Michael says, the central
server handing out XIDs and snapshots is bound to become a bottleneck.
That type of system implicitly constructs a total order of all
distributed transactions, but we don't really need a total order. If
two transactions don't touch the same data and there's no overlapping
transaction that can notice the commit order, then we could make those
commit decisions independently on different nodes without caring which
one "happens first". The problem is that it might take so much
bookkeeping to figure out whether that is in fact the case in a
particular instance that it's even more expensive than having a
central server that functions as a global bottleneck.

It might be worth some study not only of Postgres-XL but also of other
databases that claim to provide distributed transactional consistency
across nodes. I've found literature on this topic from time to time
over the years, but I'm not sure what the best practices in this area
actually are.

Yeah it's worth to study other databases and to consider the approach
that goes well with the PostgreSQL architecture. I've read some papers
related to distributed transaction management but I'm also not sure
what the best practices in this area are. However, one trend I've seen
is that some cloud-native databases such as Google Spanner[1]https://research.google.com/archive/spanner.html and
Cockroachdb employs the tecniques using timestamps to determine the
visibility without centralized coordination. Google Spanner uses GPS
clocks and atomic clocks but since these are not common hardware
Cockroachdb uses local timestamps with NTP instead. Also, other
transaction techniques using local timestamp have been discussed. For
example Clock-SI[2]https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf derives snapshots and commit timestamps from
loosely synchronized physical clocks, though it doesn't support
serializable isolation level. IIUC postgrespro multi-master cluster
employs the technique based on that. I've not read deeply yet but I
found new paper[3]https://arxiv.org/pdf/1704.01355.pdf on last week which introduces new SI mechanism that
allows transactions to determine their timestamps autonomously,
without relying on centralized coordination. PostgreSQL uses XID to
determine visibility now but mapping XID to its timestamp using commit
timestmap feature might be able to allow PostgreSQL to use the
timestamp for that purpose.

https://en.wikipedia.org/wiki/Global_serializability
claims that a technique called Commitment Ordering (CO) is teh
awesome, but I've got my doubts about whether that's really an
objective description of the state of the art. One clue is that the
global serialiazability article says three separate times that the
technique has been widely misunderstood. I'm not sure exactly which
Wikipedia guideline that violates, but I think Wikipedia is supposed
to summarize the views that exist on a topic in accordance with their
prevalence, not take a position on which view is correct.
https://en.wikipedia.org/wiki/Commitment_ordering contains citations
from the papers only of one guy, Yoav Raz, which is another hint that
this may not be as widely-regarded a technique as the person who wrote
these articles thinks it should be. Anyway, it would be good to
understand what other well-regarded systems do before we choose what
we want to do.

[1]: https://research.google.com/archive/spanner.html
[2]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf
[3]: https://arxiv.org/pdf/1704.01355.pdf

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#143

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#138)

Re: Transactions involving multiple postgres foreign servers

On Tue, Aug 1, 2017 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The remote transaction can be committed/aborted only after the fate of
the local transaction is decided. If we commit remote transaction and
abort local transaction, that's not good. AtEOXact* functions are
called immediately after that decision in post-commit/abort phase. So,
if we want to commit/abort the remote transaction immediately it has
to be done in post-commit/abort processing. Instead if we delegate
that to the remote transaction resolved backend (introduced by the
patches) the delay between local commit and remote commits depends
upon when the resolve gets a chance to run and process those
transactions. One could argue that that delay would anyway exist when
post-commit/abort processing fails to resolve remote transaction. But
given the real high availability these days, in most of the cases
remote transaction will be resolved in the post-commit/abort phase. I
think we should optimize for most common case. Your concern is still
valid, that we shouldn't raise an error or do anything critical in
post-commit/abort phase. So we should device a way to send
COMMIT/ABORT prepared messages to the remote server in asynchronous
fashion carefully avoiding errors. Recent changes to 2PC have improved
performance in that area to a great extent. Relying on resolver
backend to resolve remote transactions would erode that performance
gain.

I think there are two separate but interconnected issues here. One is
that if we give the user a new command prompt without resolving the
remote transaction, then they might run a new query that sees their
own work as committed, which would be bad. Or, they might commit,
wait for the acknowledgement, and then tell some other session to go
look at the data, and find it not there. That would also be bad. I
think the solution is likely to do something like what we did for
synchronous replication in commit
9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove
transaction to be resolved (by the background process) but allow an
interrupt to escape the wait-loop.

The second issue is that having the resolver resolve transactions
might be slower than doing it in the foreground. I don't necessarily
see a reason why that should be a big problem. I mean, the resolver
might need to establish a separate connection, but if it keeps that
connection open for a while (say, 5 minutes) in case further
transactions arrive then it won't be an issue except on really
low-volume system which isn't really a case I think we need to worry
about very much. Also, the hand-off to the resolver might take some
time, but that's equally true for sync rep and we're living with it
there. Anything else is presumably just the resolver itself being
inefficient which seems like something that can simply be fixed.

I think using the solution similar to sync rep to wait for the
transaction to be resolved is a good way. One concern I have is that
if we have one resolver process per one backend process the switching
connection between participant nodes would be overhead. In current
implementation the backend process uses connection caches to the
remote server. On the other hand if we have one resolver process per
one database on remote server the backend process have to communicate
with multiple resolver processes.

FWIW, I don't think the present resolver implementation is likely to
be what we want. IIRC, it's just calling an SQL function which
doesn't seem like a good approach. Ideally we should stick an entry
into a shared memory queue and then ping the resolver via SetLatch,
and it can directly invoke an FDW method on the data from the shared
memory queue. It should be possible to set things up so that a user
who wishes to do so can run multiple copies of the resolver thread at
the same time, which would be a good way to keep latency down if the
system is very busy with distributed transactions.

In current implementation the resolver process exists for resolving
in-doubt transactions. That process periodically checks if there is
unresolved transaction on shared memory and tries to resolve it
according commit log. If we change it so that the backend process can
communicate with the resolver process via SetLatch the resolver
process is better to be implemented into core rather than as a contrib
module.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#144

Stas Kelvich

s.kelvich@postgrespro.ru

over 8 years ago

In reply to: Robert Haas (#139)

Re: Transactions involving multiple postgres foreign servers

On 31 Jul 2017, at 20:03, Robert Haas <robertmhaas@gmail.com> wrote:

Regardless of whether we share XIDs or DXIDs, we need a more complex
concept of transaction state than we have now.

Seems that discussion shifted from 2PC itself to the general issues with distributed
transactions. So it is probably appropriate to share here resume of things that we
done in area of distributed visibility. During last two years we tried three quite different
approaches and finally settled with Clock-SI.

At first, to test different approaches we did small patch that wrap calls to visibility-related
functions (SetTransactionStatus, GetSnapshot, etc. Described in detail at wiki[1]https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API ) in order to
allow overload them from extension. Such approach allows to implement almost anything
related to distributed visibility since you have full control about how local visibility is done.
That API isn’t hard prerequisite, and if one wants to create some concrete implementation
it can be done just in place. However, I think it is good to have such API in some form.

So three approaches that we tried:

1) Postgres-XL-like:

That is most straightforward way. Basically we need separate network service (GTM/DTM) that is
responsible for xid generation, and managing running-list of transactions. So acquiring
xid and snapshot is done by network calls. Because of shared xid space it is possible
to compare them in ordinary way and get right order. Gap between non-simultaneous
commits by 2pc is covered by the fact that we getting our snapshots from GTM, and
it will remove xid from running list only when transaction committed on both nodes.

Such approach is okay for OLAP-style transactions where tps isn’t high. But OLTP with
high transaction rate GTM will immediately became a bottleneck since even write transactions
need to get snapshot from GTM. Even if they access only one node.

2) Incremental SI [2]http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf

Approach with central coordinator, that can allow local reads without network
communications by slightly altering visibility rules.

Despite the fact that it is kind of patented, we also failed to achieve proper visibility
by implementing algorithms from that paper. It always showed some inconsistencies.
May be because of bugs in our implementation, may be because of some
typos/mistakes in algorithm description itself. Reasoning in paper wasn’t very
clear for us, as well as patent issues, so we just leaved that.

3) Clock-SI [3]https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf

It is MS research paper, that describes algorithm similar to ones used in Spanner and
CockroachDB, without central GTM and with reads that do not require network roundtrip.

There are two ideas behind it:

* Assuming snapshot isolation and visibility on node are based on CSN, use local time as CSN,
then when you are doing 2PC, collect prepare time from all participating nodes and
commit transaction everywhere with maximum of that times. If node during read faces tuples
committed by tx with CSN greater then their snapshot CSN (that can happen due to
time desynchronisation on node) then it just waits until that time come. So time desynchronisation
can affect performance, but can’t affect correctness.

* During distributed commit transaction neither running (if it commits then tuple
should be already visible) nor committed/aborted (it still can be aborted, so it is illegal to read).
So here IN-DOUBT transaction state appears, when reader should wait for writers.

We managed to implement that using mentioned XTM api. XID<->CSN mapping is
accounted by extension itself. Speed/scalability are also good.

I want to resubmit implementation of that algorithm for FDW later in August, along with some
isolation tests based on set of queries in [4]https://github.com/ept/hermitage.

[1]: https://wiki.postgresql.org/wiki/DTM#eXtensible_Transaction_Manager_API
[2]: http://pi3.informatik.uni-mannheim.de/~norman/dsi_jour_2014.pdf
[3]: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/samehe-clocksi.srds2013.pdf
[4]: https://github.com/ept/hermitage

Stas Kelvich
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#145

Michael Paquier

michael.paquier@gmail.com

over 8 years ago

In reply to: Alvaro Herrera (#140)

Re: Transactions involving multiple postgres foreign servers

On Mon, Jul 31, 2017 at 7:27 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Robert Haas wrote:

An alternative approach is to have some kind of other identifier,
let's call it a distributed transaction ID (DXID) which is mapped by
each node onto a local XID.

Postgres-XL seems to manage this problem by using a transaction manager
node, which is in charge of assigning snapshots. I don't know how that
works, but perhaps adding that concept here could be useful too. One
critical point to that design is that the app connects not directly to
the underlying Postgres server but instead to some other node which is
or connects to the node that manages the snapshots.

Maybe Michael can explain in better detail how it works, and/or how (and
if) it could be applied here.

XL (and XC) use a transaction ID that plugs in directly with the
internal XID assigned by Postgres, actually bypassing what Postgres
assigns to each backend if a transaction needs one. So if transactions
are not heavenly shared among multiple nodes, performance gets
impacted. Now when we worked on this project we noticed that we gained
in performance by reducing the number of requests and grouping them
together, so a proxy layer has been added between the global
transaction manager and Postgres to group those requests. This does
not change the fact that read-committed transactions still need
snapshots for each query, which is consuming. So this approach hurts
less with analytic queries, and more with OLTP.

2PC transaction status was tracked as well in the GTM. This allows
fancy things like being able to prepare a transaction on node 1, and
commit it on node 2 for example. I am not honestly sure that you need
to add anything at clog level for example, but I think that having at
the FDW level the meta data of a transaction stored as a rather
correct approach on the matter. That's what greenplum actually does if
I recall correctly (Heikki save me!): it has one coordinator with such
metadata handling, and bunch of underlying nodes that store the data.
Citus does also that if I recall correctly. So instead of
decentralizing this information, this gets stored in a Postgres
coordinator instance.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#146

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#138)

Re: Transactions involving multiple postgres foreign servers

On Tue, Aug 1, 2017 at 1:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jul 27, 2017 at 8:25 AM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

The remote transaction can be committed/aborted only after the fate of
the local transaction is decided. If we commit remote transaction and
abort local transaction, that's not good. AtEOXact* functions are
called immediately after that decision in post-commit/abort phase. So,
if we want to commit/abort the remote transaction immediately it has
to be done in post-commit/abort processing. Instead if we delegate
that to the remote transaction resolved backend (introduced by the
patches) the delay between local commit and remote commits depends
upon when the resolve gets a chance to run and process those
transactions. One could argue that that delay would anyway exist when
post-commit/abort processing fails to resolve remote transaction. But
given the real high availability these days, in most of the cases
remote transaction will be resolved in the post-commit/abort phase. I
think we should optimize for most common case. Your concern is still
valid, that we shouldn't raise an error or do anything critical in
post-commit/abort phase. So we should device a way to send
COMMIT/ABORT prepared messages to the remote server in asynchronous
fashion carefully avoiding errors. Recent changes to 2PC have improved
performance in that area to a great extent. Relying on resolver
backend to resolve remote transactions would erode that performance
gain.

I think there are two separate but interconnected issues here. One is
that if we give the user a new command prompt without resolving the
remote transaction, then they might run a new query that sees their
own work as committed, which would be bad. Or, they might commit,
wait for the acknowledgement, and then tell some other session to go
look at the data, and find it not there. That would also be bad. I
think the solution is likely to do something like what we did for
synchronous replication in commit
9a56dc3389b9470031e9ef8e45c95a680982e01a -- wait for the remove
transaction to be resolved (by the background process) but allow an
interrupt to escape the wait-loop.

The second issue is that having the resolver resolve transactions
might be slower than doing it in the foreground. I don't necessarily
see a reason why that should be a big problem. I mean, the resolver
might need to establish a separate connection, but if it keeps that
connection open for a while (say, 5 minutes) in case further
transactions arrive then it won't be an issue except on really
low-volume system which isn't really a case I think we need to worry
about very much. Also, the hand-off to the resolver might take some
time, but that's equally true for sync rep and we're living with it
there. Anything else is presumably just the resolver itself being
inefficient which seems like something that can simply be fixed.

FWIW, I don't think the present resolver implementation is likely to
be what we want. IIRC, it's just calling an SQL function which
doesn't seem like a good approach. Ideally we should stick an entry
into a shared memory queue and then ping the resolver via SetLatch,
and it can directly invoke an FDW method on the data from the shared
memory queue. It should be possible to set things up so that a user
who wishes to do so can run multiple copies of the resolver thread at
the same time, which would be a good way to keep latency down if the
system is very busy with distributed transactions.

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions. For the usage of this feature,
it will be almost the same as what this patch has been doing except
for adding a new GUC paramter that controls the number of resovler
process launch. That is, we can have multiple resolver process to keep
latency down.

On technical view, the processing of the transaction involving
multiple foreign server will be changed as follows.

* Backend processes
1. In PreCommit phase, prepare the transaction on foreign servers and
save fdw_xact entries into the array on shmem. Also create a
fdw_xact_state entry on shmem hash that has the index of each fdw_xact
entry.
2. Local commit/abort.
3. Change its process state to FDWXACT_WAITING and enqueue the MyProc
to the shmem queue.
4. Ping to the resolver process via SetLatch.
5. Wait to be waken up.

* Resovler processes
1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a).
2. Get the fdw_xact_state entry from shmem hash by XID-a.
3. Iterate fdw_xact entries using the index, and resolve the foreign
transactions.
3-a. If even one foreign transaction failed to resolve, raise an error.
4. Change the waiting backend state to FDWXACT_COMPLETED and release it.

Also, the resolver process scans over the array of fdw_xact entry
periodically, and tries to resolve in-doubt transactions.
This patch still has the concern in the design and I'm planing to
update the patch for the next commit fest. So I'll mark this as
"Waiting on Author".

Feedback and suggestion are very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#147

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#146)

Re: Transactions involving multiple postgres foreign servers

On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions. For the usage of this feature,
it will be almost the same as what this patch has been doing except
for adding a new GUC paramter that controls the number of resovler
process launch. That is, we can have multiple resolver process to keep
latency down.

Multiple resolver processes is useful but gets a bit complicated. For
example, if process 1 has a connection open to foreign server A and
process 2 does not, and a request arrives that needs to be handled on
foreign server A, what happens? If process 1 is already busy doing
something else, probably we want process 2 to try to open a new
connection to foreign server A and handle the request. But if process
1 and 2 are both idle, ideally we'd like 1 to get that request rather
than 2. That seems a bit difficult to get working though. Maybe we
should just ignore such considerations in the first version.

* Resovler processes
1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a).
2. Get the fdw_xact_state entry from shmem hash by XID-a.
3. Iterate fdw_xact entries using the index, and resolve the foreign
transactions.
3-a. If even one foreign transaction failed to resolve, raise an error.
4. Change the waiting backend state to FDWXACT_COMPLETED and release it.

Comments:

- Note that any error we raise here won't reach the user; this is a
background process. We don't want to get into a loop where we just
error out repeatedly forever -- at least not if there's any other
reasonable choice.

- I suggest that we ought to track the status for each XID separately
on each server rather than just track the XID status overall. That
way, if transaction resolution fails on one server, we don't keep
trying to reconnect to the others.

- If we go to resolve a remote transaction and find that no such
remote transaction exists, what should we do? I'm inclined to think
that we should regard that as if we had succeeded in resolving the
transaction. Certainly, if we've retried the server repeatedly, it
might be that we previously succeeded in resolving the transaction but
then the network connection was broken before we got the success
message back from the remote server. But even if that's not the
scenario, I think we should assume that the DBA or some other system
resolved it and therefore we don't need to do anything further. If we
assume anything else, then we just go into an infinite error loop,
which isn't useful behavior. We could log a message, though (for
example, LOG: unable to resolve foreign transaction ... because it
does not exist).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#148

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#147)

Re: Transactions involving multiple postgres foreign servers

On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions. For the usage of this feature,
it will be almost the same as what this patch has been doing except
for adding a new GUC paramter that controls the number of resovler
process launch. That is, we can have multiple resolver process to keep
latency down.

Multiple resolver processes is useful but gets a bit complicated. For
example, if process 1 has a connection open to foreign server A and
process 2 does not, and a request arrives that needs to be handled on
foreign server A, what happens? If process 1 is already busy doing
something else, probably we want process 2 to try to open a new
connection to foreign server A and handle the request. But if process
1 and 2 are both idle, ideally we'd like 1 to get that request rather
than 2. That seems a bit difficult to get working though. Maybe we
should just ignore such considerations in the first version.

I understood. I keep it simple in the first version.

* Resovler processes
1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a).
2. Get the fdw_xact_state entry from shmem hash by XID-a.
3. Iterate fdw_xact entries using the index, and resolve the foreign
transactions.
3-a. If even one foreign transaction failed to resolve, raise an error.
4. Change the waiting backend state to FDWXACT_COMPLETED and release it.

Comments:

- Note that any error we raise here won't reach the user; this is a
background process. We don't want to get into a loop where we just
error out repeatedly forever -- at least not if there's any other
reasonable choice.

Thank you for the comments.

Agreed.

- I suggest that we ought to track the status for each XID separately
on each server rather than just track the XID status overall. That
way, if transaction resolution fails on one server, we don't keep
trying to reconnect to the others.

Agreed. In the current patch we manage fdw_xact entries that track the
status for each XID separately on each server. I'm going to use the
same mechanism. The resolver process get an target XID from shmem
queue and get the all fdw_xact entries associated with the XID from
the fdw_xact array in shmem. But since the scanning the whole fdw_xact
entries could be slow because the number of entry of fdw_xact array
could be a large number (e.g, max_connections * # of foreign servers),
I'm considering to have a linked list of the all fdw_xact entries
associated with same XID, and to have a shmem hash pointing to the
first fdw_xact entry of the linked lists for each XID. That way, we
can find the target fdw_xact entries from the array in O(1).

- If we go to resolve a remote transaction and find that no such
remote transaction exists, what should we do? I'm inclined to think
that we should regard that as if we had succeeded in resolving the
transaction. Certainly, if we've retried the server repeatedly, it
might be that we previously succeeded in resolving the transaction but
then the network connection was broken before we got the success
message back from the remote server. But even if that's not the
scenario, I think we should assume that the DBA or some other system
resolved it and therefore we don't need to do anything further. If we
assume anything else, then we just go into an infinite error loop,
which isn't useful behavior. We could log a message, though (for
example, LOG: unable to resolve foreign transaction ... because it
does not exist).

Agreed.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#149

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 8 years ago

In reply to: Masahiko Sawada (#148)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 27, 2017 at 12:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions. For the usage of this feature,
it will be almost the same as what this patch has been doing except
for adding a new GUC paramter that controls the number of resovler
process launch. That is, we can have multiple resolver process to keep
latency down.

Multiple resolver processes is useful but gets a bit complicated. For
example, if process 1 has a connection open to foreign server A and
process 2 does not, and a request arrives that needs to be handled on
foreign server A, what happens? If process 1 is already busy doing
something else, probably we want process 2 to try to open a new
connection to foreign server A and handle the request. But if process
1 and 2 are both idle, ideally we'd like 1 to get that request rather
than 2. That seems a bit difficult to get working though. Maybe we
should just ignore such considerations in the first version.

I understood. I keep it simple in the first version.

While a resolver process is useful for resolving transaction later, it
seems performance effective to try to resolve the prepared foreign
transaction, in post-commit phase, in the same backend which prepared
those for two reasons 1. the backend already has a connection to that
foreign server 2. it has just run some commands to completion on that
foreign server, so it's highly likely that a COMMIT PREPARED would
succeed too. If we let a resolver process do that, we will spend time
in 1. signalling resolver process 2. setting up a connection to the
foreign server and 3. by the time resolver process tries to resolve
the prepared transaction the foreign server may become unavailable,
thus delaying the resolution.

Said that, I agree that post-commit phase doesn't have a transaction
of itself, and thus any catalog lookup, error reporting is not
possible. We will need some different approach here, which may not be
straight forward. So, we may need to delay this optimization for v2. I
think we have discussed this before, but I don't find a mail off-hand.

* Resovler processes
1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a).
2. Get the fdw_xact_state entry from shmem hash by XID-a.
3. Iterate fdw_xact entries using the index, and resolve the foreign
transactions.
3-a. If even one foreign transaction failed to resolve, raise an error.
4. Change the waiting backend state to FDWXACT_COMPLETED and release it.

Comments:

- Note that any error we raise here won't reach the user; this is a
background process. We don't want to get into a loop where we just
error out repeatedly forever -- at least not if there's any other
reasonable choice.

Thank you for the comments.

Agreed.

We should probably log an error message in the server log, so that
DBAs are aware of such a failure. Is that something you are
considering to do?

- I suggest that we ought to track the status for each XID separately
on each server rather than just track the XID status overall. That
way, if transaction resolution fails on one server, we don't keep
trying to reconnect to the others.

Agreed. In the current patch we manage fdw_xact entries that track the
status for each XID separately on each server. I'm going to use the
same mechanism. The resolver process get an target XID from shmem
queue and get the all fdw_xact entries associated with the XID from
the fdw_xact array in shmem. But since the scanning the whole fdw_xact
entries could be slow because the number of entry of fdw_xact array
could be a large number (e.g, max_connections * # of foreign servers),
I'm considering to have a linked list of the all fdw_xact entries
associated with same XID, and to have a shmem hash pointing to the
first fdw_xact entry of the linked lists for each XID. That way, we
can find the target fdw_xact entries from the array in O(1).

If we want to do something like this, would it be useful to use a data
structure similar to what is used for maintaining subtrasactions? Just
a thought.

- If we go to resolve a remote transaction and find that no such
remote transaction exists, what should we do? I'm inclined to think
that we should regard that as if we had succeeded in resolving the
transaction. Certainly, if we've retried the server repeatedly, it
might be that we previously succeeded in resolving the transaction but
then the network connection was broken before we got the success
message back from the remote server. But even if that's not the
scenario, I think we should assume that the DBA or some other system
resolved it and therefore we don't need to do anything further. If we
assume anything else, then we just go into an infinite error loop,
which isn't useful behavior. We could log a message, though (for
example, LOG: unable to resolve foreign transaction ... because it
does not exist).

Agreed.

Yes. I think the current patch takes care of this, except probably the
error message.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#150

Stas Kelvich

s.kelvich@postgrespro.ru

over 8 years ago

In reply to: Masahiko Sawada (#146)

1 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On 26 Sep 2017, at 12:06, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions.

For what it worth, I rebased latest patch to current master.

As far as I understand it is planned to change resolver arch,
so is it okay to review code that is intended for non-faulty
work scenarios?

Attachments:

fdw2pc_v13.diffapplication/octet-stream; name=fdw2pc_v13.diff; x-unix-mode=0644Download

diff --git a/contrib/fdw_transaction_resovler/Makefile b/contrib/fdw_transaction_resovler/Makefile
new file mode 100644
index 0000000..0d2e0e9
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/Makefile
@@ -0,0 +1,15 @@
+# contrib/fdw_transaction_resolver/Makefile
+
+MODULES = fdw_transaction_resolver
+PGFILEDESC = "fdw_transaction_resolver - foreign transaction resolver demon"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/fdw_transaction_resolver
+top_builddir = ../../
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/fdw_transaction_resovler/TAGS b/contrib/fdw_transaction_resovler/TAGS
new file mode 120000
index 0000000..cf64c85
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/TAGS
@@ -0,0 +1 @@
+/home/masahiko/pgsql/source/postgresql/TAGS
\ No newline at end of file
diff --git a/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
new file mode 100644
index 0000000..055d458
--- /dev/null
+++ b/contrib/fdw_transaction_resovler/fdw_transaction_resolver.c
@@ -0,0 +1,457 @@
+/* -------------------------------------------------------------------------
+ *
+ * fdw_transaction_resolver.c
+ *
+ * Contrib module to launch foreign transaction resolver to resolve unresolved
+ * transactions prepared on foreign servers.
+ *
+ * The extension launches foreign transaction resolver launcher process as a
+ * background worker. The launcher then launches separate background worker
+ * process to resolve the foreign transaction in each database. The worker
+ * process simply connects to the database specified and calls pg_fdw_xact_resolve()
+ * function, which tries to resolve the transactions. The launcher process
+ * launches at most one worker at a time.
+ *
+ * Copyright (C) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		contrib/fdw_transaction_resolver/fdw_transaction_resolver.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+/* These are always necessary for a bgworker */
+#include "miscadmin.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lwlock.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+
+/* these headers are used by this particular worker's code */
+#include "access/heapam.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/fdw_xact.h"
+#include "catalog/pg_database.h"
+#include "executor/spi.h"
+#include "fmgr.h"
+#include "lib/stringinfo.h"
+#include "pgstat.h"
+#include "utils/builtins.h"
+#include "utils/snapmgr.h"
+#include "utils/timestamp.h"
+#include "tcop/utility.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+
+/*
+ * Flags set by interrupt handlers of foreign transaction resolver for later
+ * service in the main loop.
+ */
+static volatile sig_atomic_t got_sighup = false;
+static volatile sig_atomic_t got_sigterm = false;
+static volatile sig_atomic_t got_sigquit = false;
+static volatile sig_atomic_t got_sigusr1 = false;
+
+static void FDWXactResolver_worker_main(Datum dbid_datum);
+static void FDWXactResolverMain(Datum main_arg);
+static List *get_database_list(void);
+
+/* GUC variable */
+static int	fx_resolver_naptime;
+
+/*
+ * Signal handler for SIGTERM
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGTERM(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigterm = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGQUIT
+ *		Set a flag to let the main loop to terminate, and set our latch to wake
+ *		it up.
+ */
+static void
+FDWXactResolver_SIGQUIT(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigquit = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Signal handler for SIGHUP
+ *		Set a flag to tell the main loop to reread the config file, and set
+ *		our latch to wake it up.
+ */
+static void
+FDWXactResolver_SIGHUP(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sighup = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+static void
+FDWXactResolver_SIGUSR1(SIGNAL_ARGS)
+{
+	int			save_errno = errno;
+
+	got_sigusr1 = true;
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/*
+ * Entrypoint of this module.
+ *
+ * Launches the foreign transaction resolver demon.
+ */
+void
+_PG_init(void)
+{
+	BackgroundWorker worker;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+
+	DefineCustomIntVariable("fdw_transaction_resolver.naptime",
+					  "Time to sleep between fdw_transaction_resolver runs.",
+							NULL,
+							&fx_resolver_naptime,
+							60,
+							1,
+							INT_MAX,
+							PGC_SIGHUP,
+							0,
+							NULL, NULL, NULL);
+
+	/* set up common data for all our workers */
+
+	/*
+	 * For some reason unless background worker set
+	 * BGWORKER_BACKEND_DATABASE_CONNECTION, it's not added to BackendList and
+	 * hence notification to this backend is not enabled. So set that flag
+	 * even if the backend itself doesn't need database connection.
+	 */
+	worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION;
+	worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	worker.bgw_restart_time = 5;
+	snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver launcher");
+	sprintf(worker.bgw_library_name, "fdw_transaction_resolver");
+	sprintf(worker.bgw_function_name, "FDWXactResolverMain");
+	worker.bgw_main_arg = (Datum) 0;	/* Craft some dummy arg. */
+	worker.bgw_notify_pid = 0;
+
+	RegisterBackgroundWorker(&worker);
+}
+
+void
+FDWXactResolverMain(Datum main_arg)
+{
+	/* For launching background worker */
+	BackgroundWorker worker;
+	BackgroundWorkerHandle *handle = NULL;
+	pid_t		pid;
+	List	   *dbid_list = NIL;
+	TimestampTz launched_time = GetCurrentTimestamp();
+	TimestampTz next_launch_time = launched_time + (fx_resolver_naptime * 1000L);
+
+	ereport(LOG,
+			(errmsg("fdw_transaction_resolver launcher started")));
+
+	/* Properly accept or ignore signals the postmaster might send us */
+	pqsignal(SIGHUP, FDWXactResolver_SIGHUP);	/* set flag to read config
+												 * file */
+	pqsignal(SIGTERM, FDWXactResolver_SIGTERM); /* request shutdown */
+	pqsignal(SIGQUIT, FDWXactResolver_SIGQUIT); /* hard crash time */
+	pqsignal(SIGUSR1, FDWXactResolver_SIGUSR1);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize connection */
+	BackgroundWorkerInitializeConnection(NULL, NULL);
+
+	/*
+	 * Main loop: do this until the SIGTERM handler tells us to terminate
+	 */
+	while (!got_sigterm)
+	{
+		int			rc;
+		int			naptime_msec;
+		TimestampTz current_time = GetCurrentTimestamp();
+
+		/* Determine sleep time */
+		naptime_msec = (fx_resolver_naptime * 1000L) - (current_time - launched_time);
+
+		if (naptime_msec < 0)
+			naptime_msec = 0;
+
+		/*
+		 * Background workers mustn't call usleep() or any direct equivalent:
+		 * instead, they may wait on their process latch, which sleeps as
+		 * necessary, but is awakened if postmaster dies.  That way the
+		 * background process goes away immediately in an emergency.
+		 */
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   naptime_msec,
+					   WAIT_EVENT_PG_SLEEP);
+		ResetLatch(MyLatch);
+
+		/* emergency bailout if postmaster has died */
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+
+		/*
+		 * Postmaster wants to stop this process. Exit with non-zero code, so
+		 * that the postmaster starts this process again. The worker processes
+		 * will receive the signal and end themselves. This process will
+		 * restart them if necessary.
+		 */
+		if (got_sigquit)
+			proc_exit(2);
+
+		/* In case of a SIGHUP, just reload the configuration */
+		if (got_sighup)
+		{
+			got_sighup = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		if (got_sigusr1)
+		{
+			got_sigusr1 = false;
+
+			/* If we had started a worker check whether it completed */
+			if (handle)
+			{
+				BgwHandleStatus status;
+
+				status = GetBackgroundWorkerPid(handle, &pid);
+				if (status == BGWH_STOPPED)
+					handle = NULL;
+			}
+		}
+
+		current_time = GetCurrentTimestamp();
+
+		/*
+		 * If no background worker is running, we can start one if there are
+		 * unresolved foreign transactions.
+		 */
+		if (!handle &&
+			TimestampDifferenceExceeds(next_launch_time, current_time, naptime_msec))
+		{
+			Oid			dbid;
+
+			/* Get the database list if empty */
+			if (!dbid_list)
+				dbid_list = get_database_list();
+
+			/* Launch a worker if dbid_list has database */
+			if (dbid_list)
+			{
+				/* Work on the first dbid, and remove it from the list */
+				dbid = linitial_oid(dbid_list);
+				dbid_list = list_delete_oid(dbid_list, dbid);
+
+				Assert(OidIsValid(dbid));
+
+				/* Start the foreign transaction resolver */
+				worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
+					BGWORKER_BACKEND_DATABASE_CONNECTION;
+				worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
+
+				/* We will start another worker if needed */
+				worker.bgw_restart_time = BGW_NEVER_RESTART;
+				sprintf(worker.bgw_library_name, "fdw_transaction_resolver");
+				sprintf(worker.bgw_function_name, "FDWXactResolver_worker_main");
+				snprintf(worker.bgw_name, BGW_MAXLEN, "foreign transaction resolver (dbid %u)", dbid);
+				worker.bgw_main_arg = ObjectIdGetDatum(dbid);
+
+				/* set bgw_notify_pid so that we can wait for it to finish */
+				worker.bgw_notify_pid = MyProcPid;
+
+				RegisterDynamicBackgroundWorker(&worker, &handle);
+			}
+
+			/* Set next launch time */
+			launched_time = current_time;
+			next_launch_time = TimestampTzPlusMilliseconds(launched_time,
+												fx_resolver_naptime * 1000L);
+		}
+	}
+
+	/* Time to exit */
+	ereport(LOG,
+			(errmsg("foreign transaction resolver shutting down")));
+
+	proc_exit(0);				/* done */
+}
+
+/* FDWXactWorker_SIGTERM
+ * Terminates the foreign transaction resolver worker process */
+static void
+FDWXactWorker_SIGTERM(SIGNAL_ARGS)
+{
+	/* Just terminate the current process */
+	proc_exit(1);
+}
+
+/* Per database foreign transaction resolver */
+static void
+FDWXactResolver_worker_main(Datum dbid_datum)
+{
+	char	   *command = "SELECT * FROM pg_fdw_xact_resolve() WHERE status = 'resolved'";
+	Oid			dbid = DatumGetObjectId(dbid_datum);
+	int			ret;
+
+	/*
+	 * This background worker does not loop infinitely, so we need handler
+	 * only for SIGTERM, in which case the process should just exit quickly.
+	 */
+	pqsignal(SIGTERM, FDWXactWorker_SIGTERM);
+	pqsignal(SIGQUIT, FDWXactWorker_SIGTERM);
+
+	/* Unblock signals */
+	BackgroundWorkerUnblockSignals();
+
+	/*
+	 * Run this background worker in superuser mode, so that all the foreign
+	 * server and user information isaccessible.
+	 */
+	BackgroundWorkerInitializeConnectionByOid(dbid, InvalidOid);
+
+	/*
+	 * Start a transaction on which we can call resolver function. Note that
+	 * each StartTransactionCommand() call should be preceded by a
+	 * SetCurrentStatementStartTimestamp() call, which sets both the time for
+	 * the statement we're about the run, and also the transaction start time.
+	 * Also, each other query sent to SPI should probably be preceded by
+	 * SetCurrentStatementStartTimestamp(), so that statement start time is
+	 * always up to date.
+	 *
+	 * The SPI_connect() call lets us run queries through the SPI manager, and
+	 * the PushActiveSnapshot() call creates an "active" snapshot which is
+	 * necessary for queries to have MVCC data to work on.
+	 *
+	 * The pgstat_report_activity() call makes our activity visible through
+	 * the pgstat views.
+	 */
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	SPI_connect();
+	PushActiveSnapshot(GetTransactionSnapshot());
+	pgstat_report_activity(STATE_RUNNING, command);
+
+	/* Run the resolver function */
+	ret = SPI_execute(command, false, 0);
+
+	if (ret < 0)
+		elog(LOG, "error running pg_fdw_xact_resolve() within database %d",
+			 dbid);
+
+	if (SPI_processed > 0)
+		ereport(LOG,
+				(errmsg("resolved %lu foreign transactions", SPI_processed)));
+
+	/*
+	 * And finish our transaction.
+	 */
+	SPI_finish();
+	PopActiveSnapshot();
+	CommitTransactionCommand();
+	pgstat_report_activity(STATE_IDLE, NULL);
+
+	/* Done exit now */
+	proc_exit(0);
+}
+
+/* Get database list */
+static List *
+get_database_list(void)
+{
+	List	   *dblist = NIL;
+	ListCell   *cell;
+	ListCell   *next;
+	ListCell   *prev = NULL;
+	HeapScanDesc scan;
+	HeapTuple	tup;
+	Relation	rel;
+	MemoryContext resultcxt;
+
+	/* This is the context that we will allocate our output data in */
+	resultcxt = CurrentMemoryContext;
+
+	SetCurrentStatementStartTimestamp();
+	StartTransactionCommand();
+	(void) GetTransactionSnapshot();
+
+	rel = heap_open(DatabaseRelationId, AccessShareLock);
+	scan = heap_beginscan_catalog(rel, 0, NULL);
+
+	while (HeapTupleIsValid(tup = heap_getnext(scan, ForwardScanDirection)))
+	{
+		MemoryContext oldcxt;
+
+		/*
+		 * Allocate our results in the caller's context, not the
+		 * transaction's. We do this inside the loop, and restore the original
+		 * context at the end, so that leaky things like heap_getnext() are
+		 * not called in a potentially long-lived context.
+		 */
+		oldcxt = MemoryContextSwitchTo(resultcxt);
+		dblist = lappend_oid(dblist, HeapTupleGetOid(tup));
+		MemoryContextSwitchTo(oldcxt);
+	}
+
+	heap_endscan(scan);
+	heap_close(rel, AccessShareLock);
+
+	CommitTransactionCommand();
+
+	/*
+	 * Check if database has foreign transaction entry. Delete entry from the
+	 * list if the database has.
+	 */
+	for (cell = list_head(dblist); cell != NULL; cell = next)
+	{
+		Oid			dbid = lfirst_oid(cell);
+		bool		exists;
+
+		next = lnext(cell);
+
+		exists = fdw_xact_exists(InvalidTransactionId, dbid, InvalidOid, InvalidOid);
+
+		if (!exists)
+			dblist = list_delete_cell(dblist, cell, prev);
+		else
+			prev = cell;
+	}
+
+	return dblist;
+}
diff --git a/contrib/postgres_fdw/Makefile b/contrib/postgres_fdw/Makefile
index 3543312..8054330 100644
--- a/contrib/postgres_fdw/Makefile
+++ b/contrib/postgres_fdw/Makefile
@@ -11,6 +11,7 @@ EXTENSION = postgres_fdw
 DATA = postgres_fdw--1.0.sql
 
 REGRESS = postgres_fdw
+REGRESS_OPTS= --temp-config $(top_srcdir)/contrib/postgres_fdw/pgfdw.conf
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..0526094 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -16,7 +16,9 @@
 
 #include "access/htup_details.h"
 #include "catalog/pg_user_mapping.h"
+#include "access/fdw_xact.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -73,12 +75,14 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_xact_callback(XactEvent event, void *arg);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
@@ -91,6 +95,8 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 						 bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 						 PGresult **result);
+static bool server_uses_two_phase_commit(ForeignServer *server);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry);
 
 
 /*
@@ -102,9 +108,20 @@ static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
  * will_prep_stmt must be true if caller intends to create any prepared
  * statements.  Since those don't go away automatically at transaction end
  * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * connection_error_ok if true, indicates that caller can handle connection
+ * error by itself. If false, raise error.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
  */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -136,9 +153,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 									  pgfdw_inval_callback, (Datum) 0);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
 	key = user->umid;
 
@@ -197,8 +211,20 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 			GetSysCacheHashValue1(USERMAPPINGOID,
 								  ObjectIdGetDatum(user->umid));
 
-		/* Now try to make the connection */
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		/*
+		 * If the attempt to connect to the foreign server failed, we should not
+		 * come here, unless the caller has indicated so.
+		 */
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -207,7 +233,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -217,9 +248,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -265,11 +299,29 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+			{
+				return NULL;
+			}
+			else
+			{
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						errmsg("could not connect to server \"%s\"",
+								server->servername),
+						errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+			}
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -414,15 +466,22 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer *server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		RegisterXactForeignServer(serverid, userid, server_uses_two_phase_commit(server));
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -644,185 +703,284 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
+ */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char	*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4ld_%d_%d",
+								"px", Abs(random() * RANDOM_LARGE_MULTIPLIER),
+								serverid, userid);
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
+
+/*
+ * postgresPrepareForeignTransaction
+ *
+ * The function prepares transaction on foreign server.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  int prep_info_len, char *prep_info)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		bool result;
+		PGconn	*conn = entry->conn;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%.*s'", prep_info_len,
+																	prep_info);
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+
+		if (!result)
+		{
+			/*
+			 * TODO: check whether we should raise an error or warning.
+			 * The command failed, raise a warning, so that the reason for
+			 * failure gets logged. Do not raise an error, the caller i.e. foreign
+			 * transaction manager takes care of taking appropriate action.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	else
+		return false;
+}
+
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit)
+{
+	StringInfo		command;
+	PGresult		*res;
+	ConnCacheEntry	*entry = NULL;
+	ConnCacheKey	 key;
+	bool			found;
+
+	/* Create hash key for the entry.  Assume no pad bytes in key struct */
+	key = umid;
+
+	Assert(ConnectionHash);
+	entry = hash_search(ConnectionHash, &key, HASH_FIND, &found);
+
+	if (found && entry->conn)
+	{
+		PGconn	*conn = entry->conn;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",
+							is_commit ? "COMMIT" : "ROLLBACK");
+		res = PQexec(conn, command->data);
+		result = (PQresultStatus(res) == PGRES_COMMAND_OK);
+		if (!result)
+		{
+			/*
+			 * The local transaction has ended, so there is no point in raising
+			 * error. Raise a warning so that the reason for the failure gets
+			 * logged.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+		}
+
+		PQclear(res);
+		pgfdw_cleanup_after_transaction(entry);
+		return result;
+	}
+	return false;
+}
+
+/*
+ * postgresResolvePreparedForeignTransaction
+ *
+ * The function commit or abort prepared transaction on foreign server.
+ * This function could be called when we don't have any connections to the
+ * foreign server involving distributed transaction being resolved.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit,
+										  int prep_info_len, char *prep_info)
+{
+	PGconn			*conn = NULL;
+
+	/*
+	 * If there exists a connection in the connection cache that can be used,
+	 * use it. If there is none, we need foreign server and user information
+	 * which can be obtained only when in a transaction block.
+	 * If we are resolving prepared foreign transactions immediately after
+	 * preparing them, the connection hash would have a connection. If we are
+	 * resolving them any other time, a resolver would have started a
+	 * transaction.
+	 */
+	if (ConnectionHash)
+	{
+		/* Connection hash should have a connection we want */
+		bool		found;
+		ConnCacheKey key;
+		ConnCacheEntry	*entry;
+
+		/* Create hash key for the entry.  Assume no pad bytes in key struct */
+		key = umid;
+
+		entry = (ConnCacheEntry *)hash_search(ConnectionHash, &key, HASH_FIND, &found);
+		if (found && entry->conn)
+			conn = entry->conn;
+	}
+
+	if (!conn && IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, true);
+
+	/* Proceed with resolution if we got a connection, else return false */
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%.*s'",
+							is_commit ? "COMMIT" : "ROLLBACK",
+							prep_info_len, prep_info);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+
+			if (diag_sqlstate)
+			{
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
+			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
+		}
+		else
+			result = true;
+
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+	else
+		return false;
+}
+
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry)
+{
+	/*
+	 * If there were any errors in subtransactions, and we made prepared
+	 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+	 * prepared statements. This is annoying and not terribly bulletproof,
+	 * but it's probably not worth trying harder.
+	 *
+	 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+	 * old a server postgres_fdw can communicate with.	We intentionally
+	 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+	 * extent with older servers (leaking prepared statements as we go;
+	 * but we don't really support update operations pre-8.3 anyway).
+	 */
+	if (entry->have_prep_stmt && entry->have_error)
+	{
+		PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+		PQclear(res);
+	}
+
+	entry->have_prep_stmt = false;
+	entry->have_error = false;
+	/* Reset state to show we're out of a transaction */
+	entry->xact_depth = 0;
+
+	/*
+	 * If the connection isn't in a good idle state, discard it to
+	 * recover. Next GetConnection will open a new connection.
+	 */
+	if (PQstatus(entry->conn) != CONNECTION_OK ||
+		PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+	{
+		elog(DEBUG3, "discarding connection %p", entry->conn);
+		PQfinish(entry->conn);
+		entry->conn = NULL;
+	}
+
+	/*
+	 * TODO: these next two statements should be moved to end of transaction
+	 * call back.
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
+
+/*
  * pgfdw_xact_callback --- cleanup at main-transaction end.
  */
 static void
 pgfdw_xact_callback(XactEvent event, void *arg)
 {
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
-
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
-
-	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
-	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-	{
-		PGresult   *res;
-
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
-
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			bool		abort_cleanup_failure = false;
-
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
-
-			switch (event)
-			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-
-					/*
-					 * If abort cleanup previously failed for this connection,
-					 * we can't issue any more commands against it.
-					 */
-					pgfdw_reject_incomplete_xact_state_change(entry);
-
-					/* Commit all remote transactions during pre-commit */
-					entry->changing_xact_state = true;
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-					entry->changing_xact_state = false;
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-
-					/*
-					 * Don't try to clean up the connection if we're already
-					 * in error recursion trouble.
-					 */
-					if (in_error_recursion_trouble())
-						entry->changing_xact_state = true;
-
-					/*
-					 * If connection is already unsalvageable, don't touch it
-					 * further.
-					 */
-					if (entry->changing_xact_state)
-						break;
-
-					/*
-					 * Mark this connection as in the process of changing
-					 * transaction state.
-					 */
-					entry->changing_xact_state = true;
-
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-						!pgfdw_cancel_query(entry->conn))
-					{
-						/* Unable to cancel running query. */
-						abort_cleanup_failure = true;
-					}
-					else if (!pgfdw_exec_cleanup_query(entry->conn,
-													   "ABORT TRANSACTION",
-													   false))
-					{
-						/* Unable to abort remote transaction. */
-						abort_cleanup_failure = true;
-					}
-					else if (entry->have_prep_stmt && entry->have_error &&
-							 !pgfdw_exec_cleanup_query(entry->conn,
-													   "DEALLOCATE ALL",
-													   true))
-					{
-						/* Trouble clearing prepared statements. */
-						abort_cleanup_failure = true;
-					}
-					else
-					{
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-
-					/* Disarm changing_xact_state if it all worked. */
-					entry->changing_xact_state = abort_cleanup_failure;
-					break;
-			}
-		}
-
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
-
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-			entry->changing_xact_state)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			disconnect_pg_server(entry);
-		}
-	}
-
 	/*
 	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
+	 * transction.
 	 */
 	xact_got_connection = false;
 
@@ -1193,3 +1351,25 @@ exit:	;
 		*result = last_res;
 	return timed_out;
 }
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+static bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b331..b5d0f13 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -78,10 +91,13 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +140,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -188,8 +213,11 @@ ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
  public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
  public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
  public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7        | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8        | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9        | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
  public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+(9 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -7346,3 +7374,129 @@ AND ftoptions @> array['fetch_size=60000'];
 (1 row)
 
 ROLLBACK;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   101
+(1 row)
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   103
+(1 row)
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   105
+(1 row)
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+ERROR:  duplicate key value violates unique constraint "t5_pkey"
+DETAIL:  Key (c1)=(110) already exists.
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+ERROR:  duplicate key value violates unique constraint "t6_pkey"
+DETAIL:  Key (c1)=(3) already exists.
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+ count 
+-------
+   108
+(1 row)
+
+SELECT COUNT(*) FROM "S 1"."T 6";
+ count 
+-------
+     1
+(1 row)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 67e1c59..67e1127 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/pg_fdw.conf b/contrib/postgres_fdw/pg_fdw.conf
new file mode 100644
index 0000000..b086227
--- /dev/null
+++ b/contrib/postgres_fdw/pg_fdw.conf
@@ -0,0 +1,2 @@
+ax_prepared_foreign_transactions = 100
+max_prepared_transactions = 10
diff --git a/contrib/postgres_fdw/pgfdw.conf b/contrib/postgres_fdw/pgfdw.conf
new file mode 100644
index 0000000..2184040
--- /dev/null
+++ b/contrib/postgres_fdw/pgfdw.conf
@@ -0,0 +1,2 @@
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..988f4c6 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdw_xact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -469,6 +471,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -1322,7 +1330,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1700,7 +1708,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2303,7 +2311,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2564,7 +2572,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3501,7 +3509,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3591,7 +3599,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3814,7 +3822,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..0213fdb 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdw_xact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -115,7 +116,8 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -177,6 +179,15 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern char	*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, int prep_info_len,
+											  char *prep_info);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  int prep_info_len, char *prep_info);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid, bool is_commit);
+
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 5f65d9d..68daa86 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,15 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL,
+       CONSTRAINT t5_pkey PRIMARY KEY (c1)
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -83,11 +97,14 @@ INSERT INTO "S 1"."T 4"
 	       'AAA' || to_char(id, 'FM000')
 	FROM generate_series(1, 100) id;
 DELETE FROM "S 1"."T 4" WHERE c1 % 3 != 0;	-- delete for outer join tests
+INSERT INTO "S 1"."T 5"
+	SELECT generate_series(1, 100);
 
 ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +153,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7 (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8 (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9 (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1764,3 +1794,76 @@ WHERE ftrelid = 'table30000'::regclass
 AND ftoptions @> array['fetch_size=60000'];
 
 ROLLBACK;
+
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+
+-- one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(101);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- One server supporting 2PC and another one server not supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(102);
+INSERT INTO ft8 VALUES(103);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Two supporting server.
+BEGIN;
+INSERT INTO ft8 VALUES(105);
+INSERT INTO ft9 VALUES(106);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- Local changes and two servers supporting 2PC.
+BEGIN;
+INSERT INTO ft7 VALUES(110);
+INSERT INTO ft8 VALUES(111);
+INSERT INTO ft9 VALUES(112);
+INSERT INTO "S 1"."T 6" VALUES (3);
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- transaction updating on single supporting foreign server with violation on foreign server.
+BEGIN;
+INSERT INTO ft8 VALUES(113);
+INSERT INTO ft8 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction updating on single supporting foreign server and local with violation on local.
+BEGIN;
+INSERT INTO ft8 VALUES(114);
+INSERT INTO "S 1"."T 6" VALUES (4);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
+
+-- violation on foreign server supporting 2PC.
+BEGIN;
+INSERT INTO ft8 VALUES(115);
+INSERT INTO ft9 VALUES(116);
+INSERT INTO ft9 VALUES(110); -- violation on foreign server
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+
+-- transaction involing local and foreign server with violation on local server.
+BEGIN;
+INSERT INTO ft8 VALUES(117);
+INSERT INTO ft9 VALUES(118);
+INSERT INTO "S 1"."T 6" VALUES (3); -- violation on local
+COMMIT;
+SELECT COUNT(*) FROM ft8;
+SELECT COUNT(*) FROM "S 1"."T 6";
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4b265d9..7ddbf67 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1451,6 +1451,27 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously.
+        This parameter can only be set at server start.
+       </para>
+
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml
index f32b8a8..29a496b 100644
--- a/doc/src/sgml/contrib.sgml
+++ b/doc/src/sgml/contrib.sgml
@@ -115,6 +115,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged;
  &dict-int;
  &dict-xsyn;
  &earthdistance;
+ $fdw-transaction-resolver;
  &file-fdw;
  &fuzzystrmatch;
  &hstore;
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index a59e03a..f1e9a4f 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1736,5 +1736,92 @@ GetForeignServerByName(const char *name, bool missing_ok);
     </para>
 
   </sect1>
+   <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</> transaction manager allows FDWs to read and write
+    data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</> using <function>RegisterXactForeignServer</>.
+<programlisting>
+void
+RegisterXactForeignServer(Oid serverid,
+                            Oid userid,
+                            bool two_phase_compliant)
+</programlisting>
+    <varname>two_phase_compliant</> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</> is more than zero
+    <productname>PostgreSQL</> employs Two-phase commit protocol to achieve
+    atomic distributed transaction. All the foreign servers registered should
+    support two-phase commit protocol. The two-phase commit protocol is used for
+    achieving atomic distributed transaction when more than two foreign servers
+    that support two-phase commit protocol are involved with transaction, or
+    transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol is
+    processed in two phases: prepare phase and commit phase. In prepare phase,
+    <productname>PostgreSQL</> prepares the transactions on all the foreign
+    servers registered using <function>RegisterXactForeignServer</>. If any of
+    the foreign server fails to prepare transaction, prepare phase fails. In commit
+    phase, all the prepared transactions are committed if prepare phase has succeeded
+    or rolled back if prepare phase fails to prepare transactions on all the foreign
+    servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</> to get the prepared transaction identifier for
+    each foreign server involved. It stores this identifier along with the
+    serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</> with the same identifier
+    with action FDW_XACT_RESOLVED.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</> with the same identifier with action
+    FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_fdw_xact</>. One may set up a background worker
+    process to retry the operation by installing extension fdw_transaction_resolver
+    and including $libdir/fdw_transaction_resolver.so in
+    <varname>shared_preload_libraries</>.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</> is zero, atomicity commit can
+    not be guaranteed across foreign servers. If transaction on <productname>PostgreSQL</>
+    is committed, Distributed transaction manager commit the transaction on all the
+    foreign servers registered using <function>RegisterXactForeignServer</>,
+    independent of the outcome of the same operation on other foreign servers.
+    Thus transactions on some foreign servers may be committed, while the same
+    on other foreign servers would be rolled back. If the transaction on
+    <productname>PostgreSQL</> aborts transactions on all the foreign servers
+    are aborted too.
+    </para>
+    </sect1>
 
  </chapter>
diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml
index bd371fd..edd86cf 100644
--- a/doc/src/sgml/filelist.sgml
+++ b/doc/src/sgml/filelist.sgml
@@ -117,6 +117,7 @@
 <!ENTITY dict-xsyn       SYSTEM "dict-xsyn.sgml">
 <!ENTITY dummy-seclabel  SYSTEM "dummy-seclabel.sgml">
 <!ENTITY earthdistance   SYSTEM "earthdistance.sgml">
+<!ENTITY fdw-transaction-resolver SYSTEM "fdw-transaction-resolver.sgml">
 <!ENTITY file-fdw        SYSTEM "file-fdw.sgml">
 <!ENTITY fuzzystrmatch   SYSTEM "fuzzystrmatch.sgml">
 <!ENTITY hstore          SYSTEM "hstore.sgml">
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index d83fc9e..3ce77a3 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -436,6 +436,42 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
  </sect2>
 
  <sect2>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..6e23ec1 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,10 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdw_xactdesc.o \
+	   genericdesc.o gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	   logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	   seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o \
+	   xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdw_xactdesc.c b/src/backend/access/rmgrdesc/fdw_xactdesc.c
new file mode 100644
index 0000000..fd4957c
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdw_xactdesc.c
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdw_xact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+extern void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FDWXactOnDiskData *fdw_insert_xlog = (FDWXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: ");
+		appendStringInfo(buf, "%.*s", fdw_insert_xlog->fdw_xact_id_len,
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+extern const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f72f076..d35d8e3 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,15 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s max_fdw_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..dd7ee32 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o fdw_xact.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/fdw_xact.c b/src/backend/access/transam/fdw_xact.c
new file mode 100644
index 0000000..2b35f5f
--- /dev/null
+++ b/src/backend/access/transam/fdw_xact.c
@@ -0,0 +1,2205 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xact.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdw_xact.h"
+#include "access/htup_details.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/snapmgr.h"
+
+/*
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign server/s.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server, it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid of foreign server and user.
+ *
+ * The commit is executed in two phases:
+ * First phase (executed during pre-commit processing)
+ * -----------
+ * Transactions are prepared on all the foreign servers, which can participate
+ * in two-phase commit protocol. Transaction on other foreign servers are
+ * committed in the same phase.
+ *
+ * Second phase (executed during post-commit/abort processing)
+ * ------------
+ * If first phase succeeds, foreign servers are requested to commit respective
+ * prepared transactions. If the first phase  does not succeed because of any
+ * failure, the foreign servers are asked to rollback respective prepared
+ * transactions or abort the transactions if they are not prepared.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved. During the first phase, before actually
+ * preparing the transactions, enough information is persisted to the disk and
+ * logs in order to resolve such transactions.
+ *
+ * During replay and replication FDWXactGlobal also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdw_xact records happens by the following rules:
+ *
+ *		* On PREPARE redo we add the foreign transaction to
+ *		  FDWXactGlobal->fdw_xacts. We set fdw_xact->inredo to true for
+ *		  such entries.
+ *
+ *		* On Checkpoint redo we iterate through FDWXactGlobal->fdw_xacts.
+ *		  entries that have fdw_xact->inredo set and are behind the redo_horizon.
+ *		  We save them to disk and also set fdw_xact->ondisk to true.
+ *
+ *		* On COMMIT/ABORT we delete the entry from FDWXactGlobal->fdw_xacts.
+ *		  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *		  the disk as well.
+ *
+ *		* RecoverPreparedTransactions(), StnadbyReoverPreparedTransactions() and
+ *		  PrescanPreparedTransactions() have been modified to go through
+ *		  fdw_xact->inredo entries that have not made to disk yet.
+ */
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FDWXactData *FDWXact;
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FDWXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	GetPrepareId_function get_prepare_id;
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FDWConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFDWConnections = NIL;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Record the server, userid participating in the transaction. */
+void
+RegisterXactForeignServer(Oid serverid, Oid userid, bool two_phase_commit)
+{
+	FDWConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Check if the entry already exists, if so, raise an error */
+	foreach(lcell, MyFDWConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		if (fdw_conn->serverid == serverid &&
+			fdw_conn->userid == userid)
+			ereport(ERROR,
+			(errmsg("attempt to start transction again on server %u user %u",
+					serverid, userid)));
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FDWConnection *) palloc(sizeof(FDWConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (!fdw_routine->GetPrepareId)
+			ereport(ERROR,
+					(errmsg("no prepared transaction identifier providing function for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFDWConnections = lappend(MyFDWConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/* Prepared transaction identifier can be maximum 256 bytes long */
+#define MAX_FDW_XACT_ID_LEN 256
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED			/* Status used only by pg_fdw_xact_resolve().
+								 * It doesn't appear in the in-memory entry. */
+}	FDWXactStatus;
+
+typedef struct FDWXactData
+{
+	FDWXact		fx_next;		/* Next free FDWXact entry */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FDWXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FDWXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FDWXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry
+										 * start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry
+										 * end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	int			fdw_xact_id_len;	/* Length of prepared transaction identifier */
+	char		fdw_xact_id[MAX_FDW_XACT_ID_LEN];	/* prepared transaction id */
+}	FDWXactData;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FDWXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+							serverid, userid)
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FDWXactData structs */
+	FDWXact		freeFDWXacts;
+
+	/* Number of valid FDW transaction entries */
+	int			numFDWXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FDWXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FDWXactGlobalData;
+
+static void AtProcExit_FDWXact(int code, Datum arg);
+static bool resolve_fdw_xact(FDWXact fdw_xact,
+  ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static FDWXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, int fdw_xact_id_len, char *fdw_xact_id);
+static void unlock_fdw_xact(FDWXact fdw_xact);
+static void unlock_fdw_xact_entries();
+static void remove_fdw_xact(FDWXact fdw_xact);
+static FDWXact register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+								 Oid umid, int fdw_xact_info_len, char *fdw_xact_info);
+static int	GetFDWXactList(FDWXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FDWXact fdw_xact);
+static FDWXactOnDiskData *ReadFDWXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len);
+static void prepare_foreign_transactions(void);
+static FDWXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int			max_prepared_foreign_xacts = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+static FDWXactGlobalData *FDWXactGlobal;
+
+/* foreign transaction entries locked by this backend */
+List	   *MyLockedFDWXacts = NIL;
+
+/*
+ * FDWXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+extern Size
+FDWXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FDWXactGlobalData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FDWXactData)));
+
+	return size;
+}
+
+/*
+ * FDWXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FDWXactGlobalData structure.
+ */
+extern void
+FDWXactShmemInit(void)
+{
+	bool		found;
+
+	FDWXactGlobal = ShmemInitStruct("Foreign transactions table",
+									FDWXactShmemSize(),
+									&found);
+	if (!IsUnderPostmaster)
+	{
+		FDWXact		fdw_xacts;
+		int			cnt;
+
+		Assert(!found);
+		FDWXactGlobal->freeFDWXacts = NULL;
+		FDWXactGlobal->numFDWXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FDWXact)
+			((char *) FDWXactGlobal +
+			 MAXALIGN(offsetof(FDWXactGlobalData, fdw_xacts) +
+					  sizeof(FDWXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = &fdw_xacts[cnt];
+		}
+	}
+	else
+	{
+		Assert(FDWXactGlobal);
+		Assert(found);
+	}
+}
+
+/*
+ * PreCommit_FDWXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FDWXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFDWConnections), prev = NULL; cur; cur = next)
+	{
+		FDWConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		if (!fdw_conn->two_phase_commit)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFDWConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to
+	 * use two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFDWConnections) > 1) ||
+		(list_length(MyFDWConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		prepare_foreign_transactions();
+	}
+	else if (list_length(MyFDWConnections) == 1)
+	{
+		FDWConnection *fdw_conn = lfirst(list_head(MyFDWConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server
+		 * remaining even if this server can execute two-phase-commit
+		 * protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFDWConnections should be cleared here */
+		MyFDWConnections = list_delete_cell(MyFDWConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+prepare_foreign_transactions(void)
+{
+	ListCell   *lcell;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = (FDWConnection *) lfirst(lcell);
+		char	   *fdw_xact_id;
+		int			fdw_xact_id_len;
+		FDWXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit)
+			continue;
+
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = register_fdw_xact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id_len,
+									 fdw_xact_id);
+
+		/*
+		 * Between register_fdw_xact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by the foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id_len,
+											fdw_xact_id))
+		{
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			MyFDWConnections = list_delete_ptr(MyFDWConnections, fdw_conn);
+			ereport(ERROR,
+				  (errmsg("can not prepare transaction on foreign server %s",
+						  fdw_conn->servername)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+	}
+	return;
+}
+
+/*
+ * register_fdw_xact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FDWXact
+register_fdw_xact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, int fdw_xact_id_len, char *fdw_xact_id)
+{
+	FDWXact		fdw_xact;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid,
+							   fdw_xact_id_len, fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+	LWLockRelease(FDWXactLock);
+
+	/* Remember that we have locked this entry. */
+	MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FDWXactOnDiskData, fdw_xact_id);
+	data_len = data_len + fdw_xact->fdw_xact_id_len;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FDWXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	fdw_xact_file_data->fdw_xact_id_len = fdw_xact->fdw_xact_id_len;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   fdw_xact->fdw_xact_id_len);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FDWXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FDWXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				int fdw_xact_id_len, char *fdw_xact_id)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FDWXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	if (fdw_xact_id_len > MAX_FDW_XACT_ID_LEN)
+		elog(ERROR, "foreign transaction identifier longer (%d) than allowed (%d)",
+			 fdw_xact_id_len, MAX_FDW_XACT_ID_LEN);
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FDWXactGlobal->freeFDWXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+		errhint("Increase max_prepared_foreign_transactions (currently %d).",
+				max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FDWXactGlobal->freeFDWXacts;
+	FDWXactGlobal->freeFDWXacts = fdw_xact->fx_next;
+
+	/* Insert the entry to active array */
+	Assert(FDWXactGlobal->numFDWXacts < max_prepared_foreign_xacts);
+	FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	fdw_xact->fdw_xact_id_len = fdw_xact_id_len;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, fdw_xact_id_len);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FDWXact fdw_xact)
+{
+	int			cnt;
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		if (FDWXactGlobal->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FDWXactGlobal->numFDWXacts--;
+			FDWXactGlobal->fdw_xacts[cnt] = FDWXactGlobal->fdw_xacts[FDWXactGlobal->numFDWXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_next = FDWXactGlobal->freeFDWXacts;
+			FDWXactGlobal->freeFDWXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+
+			LWLockRelease(FDWXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FDWXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FDWXactGlobal array", fdw_xact);
+}
+
+/*
+ * unlock_fdw_xact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+unlock_fdw_xact(FDWXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse
+	 * the order and process exits in-between those two, we will be left an
+	 * entry locked by this backend, which gets unlocked only at the server
+	 * restart.
+	 */
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFDWXacts = list_delete_ptr(MyLockedFDWXacts, fdw_xact);
+	LWLockRelease(FDWXactLock);
+}
+
+/*
+ * unlock_fdw_xact_entries
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+unlock_fdw_xact_entries()
+{
+	while (MyLockedFDWXacts)
+	{
+		FDWXact		fdw_xact = (FDWXact) linitial(MyLockedFDWXacts);
+
+		unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * AtProcExit_FDWXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FDWXact(int code, Datum arg)
+{
+	unlock_fdw_xact_entries();
+}
+
+/*
+ * AtEOXact_FDWXacts
+ *
+ * The function executes phase 2 of two-phase commit protocol.
+ * At the end of transaction perform following actions
+ * 1. Mark the entries locked by this backend as ABORTING or COMMITTING
+ *	  according the result of transaction.
+ * 2. Try to commit or abort the transactions on foreign servers. If that
+ *	  succeeds, remove them from foreign transaction entries, otherwise unlock
+ *	  them.
+ */
+extern void
+AtEOXact_FDWXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	foreach(lcell, MyFDWConnections)
+	{
+		FDWConnection *fdw_conn = lfirst(lcell);
+
+		/* Commit/abort prepared foreign transactions */
+		if (fdw_conn->fdw_xact)
+		{
+			FDWXact		fdw_xact = fdw_conn->fdw_xact;
+
+			fdw_xact->status = (is_commit ?
+										 FDW_XACT_COMMITTING_PREPARED :
+										 FDW_XACT_ABORTING_PREPARED);
+
+			/*
+			 * Try aborting or committing the transaction on the foreign
+			 * server
+			 */
+			if (!resolve_fdw_xact(fdw_xact, fdw_conn->resolve_prepared_foreign_xact))
+			{
+				/*
+				 * The transaction was not resolved on the foreign server,
+				 * unlock it, so that someone else can take care of it.
+				 */
+				unlock_fdw_xact(fdw_xact);
+			}
+		}
+		else
+		{
+			/*
+			 * On servers where two phase commit protocol could not be
+			 * executed we have tried to commit the transactions during
+			 * pre-commit phase. Any remaining transactions need to be
+			 * aborted.
+			 */
+			Assert(!is_commit);
+
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, is_commit))
+				elog(WARNING, "could not %s transaction on server %s",
+					 is_commit ? "commit" : "abort",
+					 fdw_conn->servername);
+
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Resolver might lock the
+	 * entries, and may not be able to unlock them if aborted in-between. In
+	 * any case, there is no reason for a foreign transaction entry to be
+	 * locked after the transaction which locked it has ended.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FDWXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ */
+extern void
+AtPrepare_FDWXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFDWConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	prepare_foreign_transactions();
+
+	/*
+	 * Unlock the foreign transaction entries so COMMIT/ROLLBACK PREPARED from
+	 * some other backend will be able to lock those if required.
+	 */
+	unlock_fdw_xact_entries();
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFDWConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * FDWXactTwoPhaseFinish
+ *
+ * This function is called as part of the COMMIT/ROLLBACK PREPARED command to
+ * commit/rollback the foreign transactions prepared as part of the local
+ * prepared transaction. The function looks for the foreign transaction entries
+ * with local_xid equal to xid of the prepared transaction and tries to resolve them.
+ */
+extern void
+FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid)
+{
+	List	   *entries_to_resolve;
+
+	FDWXactStatus status = isCommit ? FDW_XACT_COMMITTING_PREPARED :
+	FDW_XACT_ABORTING_PREPARED;
+
+	/*
+	 * Get all the entries belonging to the given transaction id locked. If
+	 * foreign transaction resolver is running, it might lock entries to check
+	 * whether they can be resolved. The search function will skip such
+	 * entries. The resolver will resolve them at a later point of time.
+	 */
+	search_fdw_xact(xid, InvalidOid, InvalidOid, InvalidOid, &entries_to_resolve);
+
+	/* Try resolving the foreign transactions */
+	while (entries_to_resolve)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+		entries_to_resolve = list_delete_first(entries_to_resolve);
+		fdw_xact->status = status;
+
+		/*
+		 * Resolve the foreign transaction. If resolution is not successful,
+		 * unlock the entry so that someone else can pick it up.
+		 */
+		if (!resolve_fdw_xact(fdw_xact,
+							  get_prepared_foreign_xact_resolver(fdw_xact)))
+			unlock_fdw_xact(fdw_xact);
+	}
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FDWXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * resolve_fdw_xact
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+resolve_fdw_xact(FDWXact fdw_xact,
+				 ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	Assert(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		   fdw_xact->status == FDW_XACT_ABORTING_PREPARED);
+
+	is_commit = (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED) ?
+		true : false;
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id_len,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FDWXactGlobal->fdw_xacts. Return NULL
+ * if foreign transacgiven does not exist.
+ */
+static FDWXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FDWXact fdw_xact;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	for (i = 0; i < FDWXactGlobal->numFDWXacts; i++)
+	{
+		fdw_xact = FDWXactGlobal->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FDWXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FDWXactLock, lock_mode);
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFDWXacts = lappend(MyLockedFDWXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+extern void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FDWXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FDWXact entry and file if exists */
+		FDWXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFDWXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FDWXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFDWXact(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FDWXactGlobal->numFDWXacts <= 0)
+	{
+		LWLockRelease(FDWXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FDWXact that need to be copied to
+	 * disk, so we perform all I/O while holding FDWXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FDWXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FDWXactGlobal->numFDWXacts; cnt++)
+	{
+		FDWXact		fdw_xact = FDWXactGlobal->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFDWXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFDWXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FDWXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFDWXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFDWXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(&read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFDWXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY,
+						   S_IRUSR | S_IWUSR);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FDWXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+/*
+ * pg_fdw_xact
+ *		Produce a view with one row per prepared transaction on foreign server.
+ *
+ * This function is here so we don't have to export the
+ * FDWXactGlobalData struct definition.
+ *
+ */
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFDWXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FDWXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFDWXactList(FDWXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FDWXactLock, LW_SHARED);
+
+	if (FDWXactGlobal->numFDWXacts == 0)
+	{
+		LWLockRelease(FDWXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FDWXactGlobal->numFDWXacts;
+	*fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FDWXactGlobal->fdw_xacts[cnt_xacts],
+			   sizeof(FDWXactData));
+
+	LWLockRelease(FDWXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * pg_fdw_xact_resolve
+ * a user interface to initiate foreign transaction resolution. The function
+ * tries to resolve the prepared transactions on foreign servers in the database
+ * from where it is run.
+ * The function prints the status of all the foreign transactions it
+ * encountered, whether resolved or not.
+ */
+Datum
+pg_fdw_xact_resolve(PG_FUNCTION_ARGS)
+{
+	MemoryContext oldcontext;
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+	List	   *entries_to_resolve;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+
+		/* We will be modifying the shared memory. Prepare to clean up on exit */
+		if (!fdwXactExitRegistered)
+		{
+			before_shmem_exit(AtProcExit_FDWXact, 0);
+			fdwXactExitRegistered = true;
+		}
+
+		/* Allocate space for and prepare the returning set */
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+		/* Switch to memory context appropriate for multiple function calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+		status->fdw_xacts = (FDWXact) palloc(sizeof(FDWXactData) * FDWXactGlobal->numFDWXacts);
+		status->num_xacts = 0;
+		status->cur_xact = 0;
+
+		/* Done preparation for the result. */
+		MemoryContextSwitchTo(oldcontext);
+
+		/*
+		 * Get entries whose foreign servers are part of the database where
+		 * this function was called. We can get information about only such
+		 * foreign servers. The function will lock the entries. The entries
+		 * which are locked by other backends and whose foreign servers belong
+		 * to this database are left out, since we can not work on those.
+		 */
+		search_fdw_xact(InvalidTransactionId, MyDatabaseId, InvalidOid, InvalidOid,
+						&entries_to_resolve);
+
+		/* Work to resolve the resolvable entries */
+		while (entries_to_resolve)
+		{
+			FDWXact		fdw_xact = linitial(entries_to_resolve);
+
+			/* Remove the entry as we will not use it again */
+			entries_to_resolve = list_delete_first(entries_to_resolve);
+
+			/* Copy the data for the sake of result. */
+			memcpy(status->fdw_xacts + status->num_xacts++,
+				   fdw_xact, sizeof(FDWXactData));
+
+			if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+				fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+			{
+				/*
+				 * We have already decided what to do with the foreign
+				 * transaction nothing to be done.
+				 */
+			}
+			else if (TransactionIdDidCommit(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+			else if (TransactionIdDidAbort(fdw_xact->local_xid))
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+			{
+				/*
+				 * The transaction is in progress but not on any of the
+				 * backends. So probably, it crashed before actual abort or
+				 * commit. So assume it to be aborted.
+				 */
+				fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+			}
+			else
+			{
+				/*
+				 * Local transaction is in progress, should not resolve the
+				 * foreign transaction. This can happen when the foreign
+				 * transaction is prepared as part of a local prepared
+				 * transaction. Just continue with the next one.
+				 */
+				unlock_fdw_xact(fdw_xact);
+				continue;
+			}
+
+			/*
+			 * Resolve the foreign transaction. If resolution was not
+			 * successful, unlock the entry so that someone else can pick it
+			 * up
+			 */
+			if (!resolve_fdw_xact(fdw_xact, get_prepared_foreign_xact_resolver(fdw_xact)))
+				unlock_fdw_xact(fdw_xact);
+			else
+				/* Update the status in the result set */
+				status->fdw_xacts[status->num_xacts - 1].status = FDW_XACT_RESOLVED;
+		}
+	}
+
+	/* Print the result set */
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FDWXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "preparing";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			case FDW_XACT_RESOLVED:
+				xact_status = "resolved";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW? */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 fdw_xact->fdw_xact_id_len));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId xid;
+	Oid			dbid;
+	Oid			serverid;
+	Oid			userid;
+	List	   *entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+		DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FDWXact		fdw_xact = linitial(entries_to_remove);
+
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFDWXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FDWXactOnDiskData *
+ReadFDWXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FDWXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FDWXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY, 0);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FDWXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FDWXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFDWXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFDWXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFDWXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFDWXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FDWXactOnDiskData *fdw_xact_file_data;
+			FDWXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFDWXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFDWXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id_len,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FDWXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFDWXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FDWXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FDWXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FDWXactData structure.
+ */
+void
+FDWXactRedoAdd(XLogReaderState *record)
+{
+	FDWXactOnDiskData *fdw_xact_data = (FDWXactOnDiskData *) XLogRecGetData(record);
+	FDWXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FDWXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id_len,
+							   fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	LWLockRelease(FDWXactLock);
+}
+/*
+ * FDWXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FDWXactGlobal.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FDWXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFDWXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..c10a027 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index cfaf8da..ff268bc 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -1498,6 +1499,12 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 
 	PredicateLockTwoPhaseFinish(xid, isCommit);
 
+	/*
+	 * Commit/Rollback the foreign transactions prepared as part of this
+	 * prepared transaction.
+	 */
+	FDWXactTwoPhaseFinish(isCommit, xid);
+
 	/* Count the prepared xact as committed or aborted */
 	AtEOXact_PgStat(isCommit);
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 52408fc..563333d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -118,6 +119,9 @@ TransactionId *ParallelCurrentXids;
  */
 int			MyXactFlags;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
+
 /*
  *	transaction states - transaction state from server perspective
  */
@@ -1983,6 +1987,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transcations */
+	PreCommit_FDWXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2139,6 +2146,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_FDWXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2160,6 +2168,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2226,6 +2236,9 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	/* Prepare step for foreign transactions */
+	AtPrepare_FDWXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2438,6 +2451,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2620,9 +2635,12 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_FDWXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
+	UnregisterTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4453,6 +4471,32 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to wrote on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
+	XactWriteLocalNode = true;
+}
+
+/*
+ * UnregisterTransactionLocalNode --- forget to wrote on local node
+ */
+void
+UnregisterTransactionLocalNode(void)
+{
+	/* Quick exits if no need to remember */
+	if (max_prepared_foreign_xacts == 0)
+		return;
+
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..581fae3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5174,6 +5175,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6264,6 +6266,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 	}
 }
 
@@ -6963,7 +6968,10 @@ StartupXLOG(void)
 			InitRecoveryTransactionEnvironment();
 
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7588,6 +7596,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7775,6 +7784,9 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	/* Recover foreign transaction state and insert into shared-memory. */
+	RecoverFDWXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9101,6 +9113,11 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	/*
+	 * We deliberately delay foreign transaction checkpointing as long as
+	 * possible.
+	 */
+	CheckPointFDWXact(checkPointRedo);
 }
 
 /*
@@ -9538,7 +9555,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9555,6 +9573,7 @@ XLogReportParameters(void)
 			xlrec.MaxConnections = MaxConnections;
 			xlrec.max_worker_processes = max_worker_processes;
 			xlrec.max_prepared_xacts = max_prepared_xacts;
+			xlrec.max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 			xlrec.max_locks_per_xact = max_locks_per_xact;
 			xlrec.wal_level = wal_level;
 			xlrec.wal_log_hints = wal_log_hints;
@@ -9570,6 +9589,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9767,6 +9787,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFDWXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9956,6 +9977,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 8287de9..8dbc347 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -17,6 +17,7 @@
 #include <unistd.h>
 #include <signal.h>
 
+#include "access/fdw_xact.h"
 #include "access/htup_details.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..44c996c 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_fdw_xacts AS
+       SELECT * FROM pg_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 9ad9915..8d6e240 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,20 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	/*
+	 * Check if the foreign server has any foreign transaction prepared on it.
+	 * If there is one, and it gets dropped, we will not have any chance to
+	 * resolve that transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm;
+		srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1418,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+							srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 845c409..3a37aa9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -487,6 +487,9 @@ ExecInsert(ModifyTableState *mtstate,
 		if (resultRelationDesc->rd_att->constr || check_partition_constr)
 			ExecConstraints(resultRelInfo, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
 			/* Perform a speculative insertion. */
@@ -750,6 +753,9 @@ ExecDelete(ModifyTableState *mtstate,
 	}
 	else
 	{
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * delete the tuple
 		 *
@@ -1049,6 +1055,9 @@ lreplace:;
 		if (resultRelationDesc->rd_att->constr || resultRelInfo->ri_PartitionCheck)
 			ExecConstraints(resultRelInfo, slot, estate);
 
+		/* Remember to wrote on local node for foreign transaction */
+		RegisterTransactionLocalNode();
+
 		/*
 		 * replace the heap tuple
 		 *
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c..a773b38 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -148,6 +148,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_SPGIST_ID:
 		case RM_BRIN_ID:
 		case RM_COMMIT_TS_ID:
+		case RM_FDW_XACT_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
 			/* just deal with xid, and done */
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..f32db3a 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +151,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FDWXactShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +272,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FDWXactShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..8e7028a 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+FDWXactLock					46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 47a5f25..1f90cda 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdw_xact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2065,6 +2066,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8ba6b1d..2096cd3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,12 @@
 					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
+# Note:  Increasing max_prepared_foreign_transactions costs ~600(?) bytes of shared memory
+# per foreign transaction slot.
+# It is not advisable to set max_prepared_foreign_transactions nonzero unless you
+# actively intend to use atomic foreign transactions feature.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
 #replacement_sort_tuples = 150000	# limits use of replacement selection sort
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 27fcf5a..9404506 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,6 +208,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 8cc4fb0..d8a7065 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -287,6 +287,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 25d5547..168cce8 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -672,6 +672,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -894,6 +895,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..41eed51 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -8,6 +8,7 @@
 #define FRONTEND 1
 #include "postgres.h"
 
+#include "access/fdw_xact.h"
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
diff --git a/src/include/access/fdw_xact.h b/src/include/access/fdw_xact.h
new file mode 100644
index 0000000..69b74af
--- /dev/null
+++ b/src/include/access/fdw_xact.h
@@ -0,0 +1,76 @@
+/*
+ * fdw_xact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdw_xact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "storage/backendid.h"
+#include "foreign/foreign.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+#include "nodes/pg_list.h"
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;
+	uint32		fdw_xact_id_len;/* Length of the value stored in the next
+								 * field */
+	/* This should always be the last member */
+	char		fdw_xact_id[FLEXIBLE_ARRAY_MEMBER];		/* variable length array
+														 * to store foreign
+														 * transaction
+														 * information. */
+}	FDWXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+extern int	max_prepared_foreign_xacts;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FDWXactShmemSize(void);
+extern void FDWXactShmemInit(void);
+extern void RecoverFDWXacts(void);
+extern TransactionId PrescanFDWXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+extern void AtEOXact_FDWXacts(bool is_commit);
+extern void AtPrepare_FDWXacts(void);
+extern void FDWXactTwoPhaseFinish(bool isCommit, TransactionId xid);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFDWXact(XLogRecPtr redo_horizon);
+extern void RegisterXactForeignServer(Oid serverid, Oid userid, bool can_prepare);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FDWXacts(void);
+extern void FDWXactRedoAdd(XLogReaderState *record);
+extern void FDWXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFDWXactRecreateFiles(XLogRecPtr redo_horizon);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 2f43c19..62702de 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index f2c10f9..b8c61a8 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -92,6 +92,9 @@ extern int	MyXactFlags;
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK	(1U << 1)
 
 
+/* Foreign transaction support */
+extern bool XactWriteLocalNode;
+
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
  */
@@ -377,6 +380,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void UnregisterTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 22a8e63..54114ae 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -227,6 +227,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3fed3b6..3189eda 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -179,6 +179,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..0f370ef 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -5420,6 +5420,12 @@ DATA(insert OID = 3992 ( dense_rank			PGNSP PGUID 12 1 0 2276 0 t f f f f f i s
 DESCR("rank of hypothetical row without gaps");
 DATA(insert OID = 3993 ( dense_rank_final	PGNSP PGUID 12 1 0 2276 0 f f f f f f i s 2 0 20 "2281 2276" "{2281,2276}" "{i,v}" _null_ _null_ _null_ hypothetical_dense_rank_final _null_ _null_ _null_ ));
 DESCR("aggregate final function");
+DATA(insert OID = 4130 ( pg_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4131 ( pg_fdw_xact_resolve	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26, 28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xact_resolve _null_ _null_ _null_ ));
+DESCR("resolve foreign prepared transactions");
+DATA(insert OID = 4132 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
 
 /* pg_upgrade support */
 DATA(insert OID = 3582 ( binary_upgrade_set_next_pg_type_oid PGNSP PGUID  12 1 0 0 0 f f f f t f v r 1 0 2278 "26" _null_ _null_ _null_ _null_ _null_ binary_upgrade_set_next_pg_type_oid _null_ _null_ _null_ ));
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index ef0fbe6..738515d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -12,6 +12,7 @@
 #ifndef FDWAPI_H
 #define FDWAPI_H
 
+#include "access/fdw_xact.h"
 #include "access/parallel.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
@@ -143,6 +144,24 @@ typedef bool (*AnalyzeForeignTable_function) (Relation relation,
 typedef List *(*ImportForeignSchema_function) (ImportForeignSchemaStmt *stmt,
 											   Oid serverOid);
 
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, int prep_info_len,
+													char *prep_info);
+
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															int prep_info_len,
+															char *prep_info);
+
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+														int *prep_info_len);
+
+
 typedef Size (*EstimateDSMForeignScan_function) (ForeignScanState *node,
 												 ParallelContext *pcxt);
 typedef void (*InitializeDSMForeignScan_function) (ForeignScanState *node,
@@ -223,6 +242,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for foreign transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 205f484..53852de 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -280,11 +280,12 @@ extern PGPROC *PreparedXactProcs;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer, checkpointer and WAL writer run during normal operation.
+ * Background writer, checkpointer, WAL writer and foreign transction resolver
+ * run during normal operation.
  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
  * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		4
+#define NUM_AUXILIARY_PROCS		5
 
 /* configurable options */
 extern int	DeadlockTimeout;
diff --git a/src/include/utils/builtins.h b/src/include/utils/builtins.h
index 762532f..a563e10 100644
--- a/src/include/utils/builtins.h
+++ b/src/include/utils/builtins.h
@@ -121,4 +121,9 @@ extern int32 type_maximum_size(Oid type_oid, int32 typemod);
 /* quote.c */
 extern char *quote_literal_cstr(const char *rawstr);
 
+/* access/transam/fdw_xact.c */
+extern Datum pg_fdw_xacts(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_resolve(PG_FUNCTION_ARGS);
+extern Datum pg_fdw_xact_remove(PG_FUNCTION_ARGS);
+
 #endif							/* BUILTINS_H */
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 142a1b8..1b28f3c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -21,4 +21,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/010_fdw_xact.pl b/src/test/recovery/t/010_fdw_xact.pl
new file mode 100644
index 0000000..58bcefd
--- /dev/null
+++ b/src/test/recovery/t/010_fdw_xact.pl
@@ -0,0 +1,186 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 9;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_foreign_transactions = 10
+max_prepared_transactions = 10
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+	has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs2->append_conf('postgresql.conf', "max_prepared_transactions = 10");
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign server
+$node_master->safe_psql('postgres', "CREATE EXTENSION postgres_fdw");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+");
+$node_master->safe_psql('postgres', "
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+");
+
+# Create user mapping
+$node_master->safe_psql('postgres', "
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+");
+
+# Ceate table on foreign server and import them.
+$node_fs1->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 AS SELECT generate_series(1,10) AS c;
+");
+$node_fs2->safe_psql('postgres', "
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 AS SELECT generate_series(1,10) AS c;
+");
+$node_master->safe_psql('postgres', "
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE local_table (c int);
+INSERT INTO local_table SELECT generate_series(1,10);
+");
+
+# Switch to synchronous replication
+$node_master->psql('postgres', "ALTER SYSTEM SET synchronous_standby_names = '*'");
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transaction involving foreign servers.
+# Check if we can commit and rollback transaction involving foreign servers after recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 3 WHERE c = 3;
+UPDATE t2 SET c = 4 WHERE c = 4;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->stop;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after recovery');
+
+#
+# Prepare two transaction involving foreign servers and shutdown master node immediately.
+# Check if we can commit and rollback transaction involving foreign servers after crash recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after crash recovery');
+$result = $node_master->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after crash recovery');
+
+#
+# Commit transactions involving foreign servers and shutdown master node immediately.
+# In this case, information about insertion and deletion of fdw_xact exists at only WAL.
+# Check if fdw_xact entry can be processed properly during recovery.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+COMMIT;
+");
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', 'SELECT count(*) FROM pg_fdw_xacts');
+is($result, 0, "Remove fdw_xact entry during recovery");
+
+#
+# A foreign server down after prepared foregin transaction but before commit it.
+# Check dangling transaction can be processed propelry by pg_fdw_xact() function.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 1 WHERE c = 1;
+UPDATE t2 SET c = 2 WHERE c = 2;
+PREPARE TRANSACTION 'gxid1';
+");
+
+$node_fs1->stop;
+
+# Since node_fs1 down COMMIT PREPARED will fail on node_fs1.
+$node_master->psql('postgres', "COMMIT PREPARED 'gxid1'");
+
+$node_fs1->start;
+$result = $node_master->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xact_resolve() WHERE status = 'resolved'");
+is($result, 1, "pg_fdw_xact_resolve function");
+
+#
+# Check if the standby node can process prepared foreign transaction after
+# promotion of the standby server.
+#
+$node_master->safe_psql('postgres', "
+BEGIN;
+UPDATE t1 SET c = 5 WHERE c = 5;
+UPDATE t2 SET c = 6 WHERE c = 6;
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+UPDATE t1 SET c = 7 WHERE c = 7;
+UPDATE t2 SET c = 8 WHERE c = 8;
+PREPARE TRANSACTION 'gxid2';
+");
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', "COMMIT PREPARED 'gxid1'");
+is($result, 0, 'Commit foreigin transaction after promotion');
+$result = $node_standby->psql('postgres', "ROLLBACK PREPARED 'gxid2'");
+is($result, 0, 'Rollback foreigin transaction after promotion');
+$result = $node_standby->safe_psql('postgres', "SELECT count(*) FROM pg_fdw_xacts");
+is($result, 0, "Check fdw_xact entry on new master node");
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..6f1da60 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1323,6 +1323,13 @@ pg_cursors| SELECT c.name,
     c.is_scrollable,
     c.creation_time
    FROM pg_cursor() c(name, statement, is_holdable, is_binary, is_scrollable, creation_time);
+pg_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_file_settings| SELECT a.sourcefile,
     a.sourceline,
     a.seqno,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index abb742b..506043c 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2263,9 +2263,11 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts. We also set max_prepared_foreign_transctions to enable testing
+		 * of atomic foreign transactions. (Note: to reduce the probability of
+		 * unexpected shmmax failures, don't set max_prepared_transactions or
+		 * max_prepared_foreign_transactions any higher than actually needed by the
+		 * corresponding regression tests.).
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2280,7 +2282,8 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{

#151

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Ashutosh Bapat (#149)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 27, 2017 at 4:05 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Sep 27, 2017 at 12:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Tue, Sep 26, 2017 at 9:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Sep 26, 2017 at 5:06 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Based on the review comment from Robert, I'm planning to do the big
change to the architecture of this patch so that a backend process
work together with a dedicated background worker that is responsible
for resolving the foreign transactions. For the usage of this feature,
it will be almost the same as what this patch has been doing except
for adding a new GUC paramter that controls the number of resovler
process launch. That is, we can have multiple resolver process to keep
latency down.

Multiple resolver processes is useful but gets a bit complicated. For
example, if process 1 has a connection open to foreign server A and
process 2 does not, and a request arrives that needs to be handled on
foreign server A, what happens? If process 1 is already busy doing
something else, probably we want process 2 to try to open a new
connection to foreign server A and handle the request. But if process
1 and 2 are both idle, ideally we'd like 1 to get that request rather
than 2. That seems a bit difficult to get working though. Maybe we
should just ignore such considerations in the first version.

I understood. I keep it simple in the first version.

While a resolver process is useful for resolving transaction later, it
seems performance effective to try to resolve the prepared foreign
transaction, in post-commit phase, in the same backend which prepared
those for two reasons 1. the backend already has a connection to that
foreign server 2. it has just run some commands to completion on that
foreign server, so it's highly likely that a COMMIT PREPARED would
succeed too. If we let a resolver process do that, we will spend time
in 1. signalling resolver process 2. setting up a connection to the
foreign server and 3. by the time resolver process tries to resolve
the prepared transaction the foreign server may become unavailable,
thus delaying the resolution.

I think that making a resolver process have connection caches to each
foreign server for a while can reduce the overhead of connection to
foreign servers. These connections will be invalidated by DDLs. Also,
most of the time we spend to commit a distributed transaction is the
interaction between the coordinator and foreign servers using
two-phase commit protocal. So I guess the time in signalling to a
resolver process would not be a big overhead.

Said that, I agree that post-commit phase doesn't have a transaction
of itself, and thus any catalog lookup, error reporting is not
possible. We will need some different approach here, which may not be
straight forward. So, we may need to delay this optimization for v2. I
think we have discussed this before, but I don't find a mail off-hand.

* Resovler processes
1. Fetch PGPROC entry from the shmem queue and get its XID (say, XID-a).
2. Get the fdw_xact_state entry from shmem hash by XID-a.
3. Iterate fdw_xact entries using the index, and resolve the foreign
transactions.
3-a. If even one foreign transaction failed to resolve, raise an error.
4. Change the waiting backend state to FDWXACT_COMPLETED and release it.

Comments:

- Note that any error we raise here won't reach the user; this is a
background process. We don't want to get into a loop where we just
error out repeatedly forever -- at least not if there's any other
reasonable choice.

Thank you for the comments.

Agreed.

We should probably log an error message in the server log, so that
DBAs are aware of such a failure. Is that something you are
considering to do?

Yes, a resolver process logs an error message in that case.

- I suggest that we ought to track the status for each XID separately
on each server rather than just track the XID status overall. That
way, if transaction resolution fails on one server, we don't keep
trying to reconnect to the others.

Agreed. In the current patch we manage fdw_xact entries that track the
status for each XID separately on each server. I'm going to use the
same mechanism. The resolver process get an target XID from shmem
queue and get the all fdw_xact entries associated with the XID from
the fdw_xact array in shmem. But since the scanning the whole fdw_xact
entries could be slow because the number of entry of fdw_xact array
could be a large number (e.g, max_connections * # of foreign servers),
I'm considering to have a linked list of the all fdw_xact entries
associated with same XID, and to have a shmem hash pointing to the
first fdw_xact entry of the linked lists for each XID. That way, we
can find the target fdw_xact entries from the array in O(1).

If we want to do something like this, would it be useful to use a data
structure similar to what is used for maintaining subtrasactions? Just
a thought.

Thank you for the advise, I'll consider that. But what I want to do is
just grouping the fdw_xact entries by XID and fetching the group of
fdw_xact in O(1) so we might not need to have the group as using a
stack like that is used for maintaining subtransactions.

- If we go to resolve a remote transaction and find that no such
remote transaction exists, what should we do? I'm inclined to think
that we should regard that as if we had succeeded in resolving the
transaction. Certainly, if we've retried the server repeatedly, it
might be that we previously succeeded in resolving the transaction but
then the network connection was broken before we got the success
message back from the remote server. But even if that's not the
scenario, I think we should assume that the DBA or some other system
resolved it and therefore we don't need to do anything further. If we
assume anything else, then we just go into an infinite error loop,
which isn't useful behavior. We could log a message, though (for
example, LOG: unable to resolve foreign transaction ... because it
does not exist).

Agreed.

Yes. I think the current patch takes care of this, except probably the
error message.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#152

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Masahiko Sawada (#151)

Re: Transactions involving multiple postgres foreign servers

On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that making a resolver process have connection caches to each
foreign server for a while can reduce the overhead of connection to
foreign servers. These connections will be invalidated by DDLs. Also,
most of the time we spend to commit a distributed transaction is the
interaction between the coordinator and foreign servers using
two-phase commit protocal. So I guess the time in signalling to a
resolver process would not be a big overhead.

I agree. Also, in the future, we might try to allow connections to be
shared across backends. I did some research on this a number of years
ago and found that every operating system I investigated had some way
of passing a file descriptor from one process to another -- so a
shared connection cache might be possible.

Also, we might port the whole backend to use threads, and then this
problem goes way. But I don't have time to write that patch this
week. :-)

It's possible that we might find that neither of the above approaches
are practical and that the performance benefits of resolving the
transaction from the original connection are large enough that we want
to try to make it work anyhow. However, I think we can postpone that
work to a future time. Any general solution to this problem at least
needs to be ABLE to resolve transactions at a later time from a
different session, so let's get that working first, and then see what
else we want to do.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#153

Masahiko Sawada

sawada.mshk@gmail.com

over 8 years ago

In reply to: Robert Haas (#152)

Re: Transactions involving multiple postgres foreign servers

On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that making a resolver process have connection caches to each
foreign server for a while can reduce the overhead of connection to
foreign servers. These connections will be invalidated by DDLs. Also,
most of the time we spend to commit a distributed transaction is the
interaction between the coordinator and foreign servers using
two-phase commit protocal. So I guess the time in signalling to a
resolver process would not be a big overhead.

I agree. Also, in the future, we might try to allow connections to be
shared across backends. I did some research on this a number of years
ago and found that every operating system I investigated had some way
of passing a file descriptor from one process to another -- so a
shared connection cache might be possible.

It sounds good idea.

Also, we might port the whole backend to use threads, and then this
problem goes way. But I don't have time to write that patch this
week. :-)

It's possible that we might find that neither of the above approaches
are practical and that the performance benefits of resolving the
transaction from the original connection are large enough that we want
to try to make it work anyhow. However, I think we can postpone that
work to a future time. Any general solution to this problem at least
needs to be ABLE to resolve transactions at a later time from a
different session, so let's get that working first, and then see what
else we want to do.

I understood and agreed. I'll post the first version patch of new
design to next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#154

Daniel Gustafsson

daniel@yesql.se

over 8 years ago

In reply to: Masahiko Sawada (#153)

Re: Transactions involving multiple postgres foreign servers

On 02 Oct 2017, at 08:31, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that making a resolver process have connection caches to each
foreign server for a while can reduce the overhead of connection to
foreign servers. These connections will be invalidated by DDLs. Also,
most of the time we spend to commit a distributed transaction is the
interaction between the coordinator and foreign servers using
two-phase commit protocal. So I guess the time in signalling to a
resolver process would not be a big overhead.

I agree. Also, in the future, we might try to allow connections to be
shared across backends. I did some research on this a number of years
ago and found that every operating system I investigated had some way
of passing a file descriptor from one process to another -- so a
shared connection cache might be possible.

It sounds good idea.

Also, we might port the whole backend to use threads, and then this
problem goes way. But I don't have time to write that patch this
week. :-)

It's possible that we might find that neither of the above approaches
are practical and that the performance benefits of resolving the
transaction from the original connection are large enough that we want
to try to make it work anyhow. However, I think we can postpone that
work to a future time. Any general solution to this problem at least
needs to be ABLE to resolve transactions at a later time from a
different session, so let's get that working first, and then see what
else we want to do.

I understood and agreed. I'll post the first version patch of new
design to next CF.

Closing this patch with Returned with feedback in this commitfest, looking
forward to a new version in an upcoming commitfest.

cheers ./daniel

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#155

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 8 years ago

In reply to: Robert Haas (#152)

Re: Transactions involving multiple postgres foreign servers

On Fri, Sep 29, 2017 at 9:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:

It's possible that we might find that neither of the above approaches
are practical and that the performance benefits of resolving the
transaction from the original connection are large enough that we want
to try to make it work anyhow. However, I think we can postpone that
work to a future time. Any general solution to this problem at least
needs to be ABLE to resolve transactions at a later time from a
different session, so let's get that working first, and then see what
else we want to do.

+1.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#156

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#153)

3 attachment(s)

Re: Transactions involving multiple postgres foreign servers

On Mon, Oct 2, 2017 at 3:31 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Sat, Sep 30, 2017 at 12:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 27, 2017 at 11:15 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I think that making a resolver process have connection caches to each
foreign server for a while can reduce the overhead of connection to
foreign servers. These connections will be invalidated by DDLs. Also,
most of the time we spend to commit a distributed transaction is the
interaction between the coordinator and foreign servers using
two-phase commit protocal. So I guess the time in signalling to a
resolver process would not be a big overhead.

I agree. Also, in the future, we might try to allow connections to be
shared across backends. I did some research on this a number of years
ago and found that every operating system I investigated had some way
of passing a file descriptor from one process to another -- so a
shared connection cache might be possible.

It sounds good idea.

Also, we might port the whole backend to use threads, and then this
problem goes way. But I don't have time to write that patch this
week. :-)

It's possible that we might find that neither of the above approaches
are practical and that the performance benefits of resolving the
transaction from the original connection are large enough that we want
to try to make it work anyhow. However, I think we can postpone that
work to a future time. Any general solution to this problem at least
needs to be ABLE to resolve transactions at a later time from a
different session, so let's get that working first, and then see what
else we want to do.

I understood and agreed. I'll post the first version patch of new
design to next CF.

Attached latest version patch. I've heavily changed the patch since
previous one. The most part I modified is the resolving foreign
transaction and handling of dangling transactions. The part of
management of fdwxact entries is almost same as the previous patch.

Foreign Transaction Resolver
======================
I introduced a new background worker called "foreign transaction
resolver" which is responsible for resolving the transaction prepared
on foreign servers. The foreign transaction resolver process is
launched by backend processes when commit/rollback transaction. And it
periodically resolves the queued transactions on a database as long as
the queue is not empty. If the queue has been empty for the certain
time specified by foreign_transaction_resolver_time GUC parameter, it
exits. It means that the backend doesn't launch a new resolver process
if the resolver process is already working. In this case, the backend
process just adds the entry to the queue on shared memory and wake it
up. The maximum number of resolver process we can launch is controlled
by max_foreign_transaction_resolvers. So we recommends to set larger
max_foreign_transaction_resolvers value than the number of databases.
The resolver process also tries to resolve dangling transaction as
well in a cycle.

Processing Sequence
=================
I've changed the processing sequence of resolving foreign transaction
so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK
prepared) is executed by a resolver process, not by backend process.
The basic processing sequence is following;

* Backend process
1. In pre-commit phase, the backend process saves fdwxact entries, and
then prepares transaction on all foreign servers that can execute
two-phase commit protocol.
2. Local commit.
3. Enqueue itself to the shmem queue and change its status to WAITING
4. launch or wakeup a resolver process and wait

* Resolver process
1. Dequeue the waiting process from shmem qeue
2. Collect the fdwxact entries that are associated with the waiting process.
3. Resolve foreign transactoins
4. Release the waiting process

5. Wake up and restart

This is still under the design phase and I'm sure that there is room
for improvement and consider more sensitive behaviour but I'd like to
share the current status of the patch. The patch includes regression
tests but not includes fully documentation.

Feedback and comment are very welcome.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

0003-postgres_fdw-supports-atomic-commit-APIs_v13.patchtext/x-patch; charset=US-ASCII; name=0003-postgres_fdw-supports-atomic-commit-APIs_v13.patchDownload

From 7751dff60b296bf6c360d16a11e34ebb967a60ae Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 16:06:54 +0900
Subject: [PATCH 3/3] postgres_fdw supports atomic commit APIs.

---
 contrib/postgres_fdw/connection.c              | 566 +++++++++++++++----------
 contrib/postgres_fdw/expected/postgres_fdw.out | 343 ++++++++++++++-
 contrib/postgres_fdw/option.c                  |   5 +-
 contrib/postgres_fdw/postgres_fdw.c            |  86 +++-
 contrib/postgres_fdw/postgres_fdw.h            |  14 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      | 133 ++++++
 doc/src/sgml/postgres-fdw.sgml                 |  37 ++
 7 files changed, 945 insertions(+), 239 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..2db390b 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,9 +14,11 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -73,13 +75,13 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
@@ -91,24 +93,27 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 						 bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 						 PGresult **result);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit);
+static ConnCacheEntry *GetConnectionCacheEntry(Oid umid);
 
-
-/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetExistingConnection(Oid umid)
 {
-	bool		found;
-	ConnCacheEntry *entry;
-	ConnCacheKey key;
+	ConnCacheEntry	*entry;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	Assert(entry->conn != NULL);
+
+	return entry->conn;
+}
+
+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
+{
+	ConnCacheEntry	*entry;
+	ConnCacheKey	key;
+	bool			found;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
@@ -128,7 +133,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
-		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
 									  pgfdw_inval_callback, (Datum) 0);
@@ -136,11 +140,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 									  pgfdw_inval_callback, (Datum) 0);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -155,6 +156,28 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->conn = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
+{
+	ConnCacheEntry *entry;
+
+	/* Get connection cache entry from cache */
+	entry = GetConnectionCacheEntry(user->umid);
+
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
@@ -198,7 +221,16 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 								  ObjectIdGetDatum(user->umid));
 
 		/* Now try to make the connection */
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -207,7 +239,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -217,9 +254,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -265,11 +305,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						 errmsg("could not connect to server \"%s\"",
+								server->servername),
+						 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -414,15 +468,24 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer	*server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		FdwXactRegisterForeignServer(serverid, userid,
+									 server_uses_two_phase_commit(server),
+									 false);
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -644,193 +707,6 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
- */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
-{
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
-
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
-
-	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
-	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-	{
-		PGresult   *res;
-
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
-
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			bool		abort_cleanup_failure = false;
-
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
-
-			switch (event)
-			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-
-					/*
-					 * If abort cleanup previously failed for this connection,
-					 * we can't issue any more commands against it.
-					 */
-					pgfdw_reject_incomplete_xact_state_change(entry);
-
-					/* Commit all remote transactions during pre-commit */
-					entry->changing_xact_state = true;
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-					entry->changing_xact_state = false;
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-
-					/*
-					 * Don't try to clean up the connection if we're already
-					 * in error recursion trouble.
-					 */
-					if (in_error_recursion_trouble())
-						entry->changing_xact_state = true;
-
-					/*
-					 * If connection is already unsalvageable, don't touch it
-					 * further.
-					 */
-					if (entry->changing_xact_state)
-						break;
-
-					/*
-					 * Mark this connection as in the process of changing
-					 * transaction state.
-					 */
-					entry->changing_xact_state = true;
-
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-						!pgfdw_cancel_query(entry->conn))
-					{
-						/* Unable to cancel running query. */
-						abort_cleanup_failure = true;
-					}
-					else if (!pgfdw_exec_cleanup_query(entry->conn,
-													   "ABORT TRANSACTION",
-													   false))
-					{
-						/* Unable to abort remote transaction. */
-						abort_cleanup_failure = true;
-					}
-					else if (entry->have_prep_stmt && entry->have_error &&
-							 !pgfdw_exec_cleanup_query(entry->conn,
-													   "DEALLOCATE ALL",
-													   true))
-					{
-						/* Trouble clearing prepared statements. */
-						abort_cleanup_failure = true;
-					}
-					else
-					{
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-
-					/* Disarm changing_xact_state if it all worked. */
-					entry->changing_xact_state = abort_cleanup_failure;
-					break;
-			}
-		}
-
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
-
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-			entry->changing_xact_state)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			disconnect_pg_server(entry);
-		}
-	}
-
-	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-	 */
-	xact_got_connection = false;
-
-	/* Also reset cursor numbering for next transaction */
-	cursor_number = 0;
-}
-
-/*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
@@ -1193,3 +1069,255 @@ exit:	;
 		*result = last_res;
 	return timed_out;
 }
+
+/*
+ * The function prepares transaction on foreign server. This function
+ * is called only at the pre-commit phase of the local transaction. Since
+ * we should have the connection to the server that we are interested in
+ * we don't use serverid and userid that are necessary to get user mapping
+ * that is the key of the connection cache.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  const char *prep_id)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool		result;
+
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_id);
+
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		elog(DEBUG1, "prepare foreign transaction on server %u with ID %s",
+			 serverid, prep_id);
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or abort the transactionon foreign server. This
+ * function is called both at the pre-commit phase of the local transaction
+ * when committing and at the end of the local transaction when aborting.
+ * Since we should the connections to the server that involved with the local
+ * transaction we don't use serverid and userid that are necessary to get
+ * user mapping that is the key of connection cache.
+ */
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid,
+							  bool is_commit)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	/*
+	 * If abort cleanup previously failed for this connection, we can't issue
+	 * any more commands against it.
+	 */
+	if (is_commit)
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",	is_commit ? "COMMIT" : "ROLLBACK");
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or aborts prepared transaction on foreign server.
+ * This function could be called both at end of the local transaction and
+ * in a new transaction, for example, by the resolver process.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit, const char *prep_id)
+{
+	ConnCacheEntry *entry;
+	PGconn			*conn;
+
+	/*
+	 * If we are in a valid transaction state it means that we are trying to
+	 * resolve a transaction in a new transaction just before started and that
+	 * we might not have a connect to the server yet. So we use GetConnection
+	 * which establishes the connection if don't have it yet. This can happen when
+	 * the foreign transaction resolve process tries to resolve it. On the other
+	 * hand, if we are not in a valid transaction state it means that we are trying
+	 * to resolve a foreign transaction at end of the local transaction. Since we
+	 * should have the connection to the server we just get a connection cache entry.
+	 */
+	if (IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, false);
+	else
+	{
+		entry = GetConnectionCacheEntry(umid);
+
+		/* Reject further use of connections which failed abort cleanup */
+		if (is_commit)
+			pgfdw_reject_incomplete_xact_state_change(entry);
+
+		conn = entry->conn;
+	}
+
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%s'",
+						 is_commit ? "COMMIT" : "ROLLBACK",
+						 prep_id);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+
+			if (diag_sqlstate)
+			{
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
+			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
+		}
+		else
+			result = true;
+
+		elog(DEBUG1, "%s prepared foreign transaction on server %u with ID %s",
+			 is_commit ? "committed" : "aborted", serverid, prep_id);
+
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+
+	return false;
+}
+
+/* Cleanup at main-transaction end */
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit)
+{
+	if (entry->xact_depth > 0)
+	{
+		if (is_commit)
+		{
+			/*
+			 * If there were any errors in subtransactions, and we made prepared
+			 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+			 * prepared statements. This is annoying and not terribly bulletproof,
+			 * but it's probably not worth trying harder.
+			 *
+			 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+			 * old a server postgres_fdw can communicate with.	We intentionally
+			 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+			 * extent with older servers (leaking prepared statements as we go;
+			 * but we don't really support update operations pre-8.3 anyway).
+			 */
+			if (entry->have_prep_stmt && entry->have_error)
+			{
+				PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+				PQclear(res);
+			}
+
+			entry->have_prep_stmt = false;
+			entry->have_error = false;
+		}
+		else
+		{
+			/*
+			 * Don't try to clean up the connection if we're already in error
+			 * recursion trouble.
+			 */
+			if (in_error_recursion_trouble())
+				entry->changing_xact_state = true;
+
+			/* If connection is already unsalvageable, don't touch it further */
+
+			if (!entry->changing_xact_state)
+			{
+				entry->changing_xact_state = true;
+
+				if (entry->have_prep_stmt &&
+					!pgfdw_exec_cleanup_query(entry->conn, "DEALLOCATE ALL", true))
+				{
+					entry->changing_xact_state = true;
+				}
+			}
+		}
+		/* Reset state to show we're out of a transaction */
+		entry->xact_depth = 0;
+
+		/*
+		 * If the connection isn't in a good idle state, discard it to
+		 * recover. Next GetConnection will open a new connection.
+		 */
+		if (PQstatus(entry->conn) != CONNECTION_OK ||
+			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+		{
+			elog(DEBUG3, "discarding connection %p", entry->conn);
+			PQfinish(entry->conn);
+			entry->conn = NULL;
+		}
+	}
+
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4339bbf..3f9ded9 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,13 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -82,6 +94,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +137,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -180,16 +202,19 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                                      List of foreign tables
- Schema |   Table    |  Server   |                   FDW options                    | Description 
---------+------------+-----------+--------------------------------------------------+-------------
- public | ft1        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft2        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
- public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+                                         List of foreign tables
+ Schema |      Table       |  Server   |                   FDW options                    | Description 
+--------+------------------+-----------+--------------------------------------------------+-------------
+ public | ft1              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft2              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft4              | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
+ public | ft5              | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft6              | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7_not_twophase | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8_twophase     | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9_twophase     | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft_pg_type       | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
+(9 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -7466,3 +7491,301 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 (4 rows)
 
 RESET enable_partition_wise_join;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+  srvname  
+-----------
+ loopback
+ loopback2
+ loopback3
+(3 rows)
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+ERROR:  null value in column "c1" violates not-null constraint
+DETAIL:  Failing row contains (null).
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+ERROR:  can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 67e1c59..67e1127 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..fea5fe6 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -348,6 +350,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+extern char*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 
 /*
  * Helper functions
@@ -420,7 +423,6 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
 				  const PgFdwRelationInfo *fpinfo_o,
 				  const PgFdwRelationInfo *fpinfo_i);
 
-
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
  * to my callback routines.
@@ -469,6 +471,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -476,6 +484,38 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 }
 
 /*
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
+ */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+	/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4ld_%d_%d",
+			 "px", Abs(random() * RANDOM_LARGE_MULTIPLIER),
+			 serverid, userid);
+
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
+
+/*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
@@ -1322,7 +1362,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1671,6 +1711,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	Oid			userid;
 	ForeignTable *table;
 	UserMapping *user;
+	ForeignServer *server;
 	AttrNumber	n_params;
 	Oid			typefnoid;
 	bool		isvarlena;
@@ -1698,9 +1739,15 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	user = GetUserMapping(userid, table->serverid);
+	server = GetForeignServer(user->serverid);
+
+	/* Remember this foreign server has been modified */
+	FdwXactRegisterForeignServer(user->serverid, user->userid,
+								 server_uses_two_phase_commit(server),
+								 true);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2303,7 +2350,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2564,7 +2611,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3501,7 +3548,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3591,7 +3638,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3814,7 +3861,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
@@ -5173,3 +5220,26 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
 	/* We didn't find any suitable equivalence class expression */
 	return NULL;
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..856ddf5 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdwxact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -115,7 +116,9 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
+extern PGconn *GetExistingConnection(Oid umid);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -123,6 +126,14 @@ extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
 extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, const char *prep_id);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid,
+										  Oid umid, bool is_commit);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  const char *prep_id);
+
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
@@ -177,6 +188,7 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern bool server_uses_two_phase_commit(ForeignServer *server);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ddfec79..817f23d 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -88,6 +101,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +150,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1817,3 +1844,109 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
 
 RESET enable_partition_wise_join;
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 265effb..280c158 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -436,6 +436,43 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions">
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </sect3>
  </sect2>
 
  <sect2>
-- 
1.8.3.1

0002-Support-atomic-commit-involving-multiple-foreign-ser_v13.patchtext/x-patch; charset=US-ASCII; name=0002-Support-atomic-commit-involving-multiple-foreign-ser_v13.patchDownload

From c87e302e17fb561a5d54ed40bc8962add7264b8a Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 16:05:35 +0900
Subject: [PATCH 2/3] Support atomic commit involving multiple foreign servers.

---
 doc/src/sgml/config.sgml                      |   84 +
 doc/src/sgml/fdwhandler.sgml                  |   87 +
 src/backend/access/rmgrdesc/Makefile          |    8 +-
 src/backend/access/rmgrdesc/fdwxactdesc.c     |   68 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    6 +-
 src/backend/access/transam/Makefile           |    6 +-
 src/backend/access/transam/fdwxact.c          | 2418 +++++++++++++++++++++++++
 src/backend/access/transam/fdwxact_resolver.c |  532 ++++++
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |   42 +
 src/backend/access/transam/xact.c             |   34 +-
 src/backend/access/transam/xlog.c             |   19 +-
 src/backend/catalog/system_views.sql          |   12 +
 src/backend/commands/foreigncmds.c            |   20 +
 src/backend/postmaster/bgworker.c             |    4 +
 src/backend/postmaster/pgstat.c               |    6 +
 src/backend/postmaster/postmaster.c           |    5 +
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlocknames.txt      |    2 +
 src/backend/storage/lmgr/proc.c               |    5 +
 src/backend/utils/misc/guc.c                  |   46 +
 src/backend/utils/misc/postgresql.conf.sample |    2 +
 src/backend/utils/probes.d                    |    2 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetwal/pg_resetwal.c             |    2 +
 src/bin/pg_waldump/rmgrdesc.c                 |    1 +
 src/include/access/fdwxact.h                  |  154 ++
 src/include/access/fdwxact_resolver.h         |   27 +
 src/include/access/resolver_private.h         |   60 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/twophase.h                 |    1 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.h                 |   11 +
 src/include/foreign/fdwapi.h                  |   18 +
 src/include/pgstat.h                          |    4 +-
 src/include/storage/proc.h                    |   10 +
 src/test/recovery/Makefile                    |    2 +-
 src/test/recovery/t/014_fdwxact.pl            |  174 ++
 src/test/regress/expected/rules.out           |   13 +
 src/test/regress/pg_regress.c                 |   13 +-
 43 files changed, 3893 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/fdwxactdesc.c
 create mode 100644 src/backend/access/transam/fdwxact.c
 create mode 100644 src/backend/access/transam/fdwxact_resolver.c
 create mode 100644 src/include/access/fdwxact.h
 create mode 100644 src/include/access/fdwxact_resolver.h
 create mode 100644 src/include/access/resolver_private.h
 create mode 100644 src/test/recovery/t/014_fdwxact.pl

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d360fc4..09bb543 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1451,6 +1451,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously. This parameter can only be set at server start.
+       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
@@ -3466,6 +3485,71 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-foregin-transaction-resolver">
+     <title>Foreign Transaction Resolvers</title>
+
+     <para>
+      These settings control the behavior of a foreign transaction resolver.
+     </para>
+
+     <variablelist>
+
+     <varlistentry id="guc-max-foreign-transaction-resolvers" xreflabel="max_foreign_transaction_resolvers">
+      <term><varname>max_foreign_transaction_resolvers</varname> (<type>int</type>)
+      <indexterm>
+       <primary><varname>max_foreign_transaction_resolvers</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies maximum number of foreign transaction resolution workers.
+       </para>
+       <para>
+        Foreign transaction resolution workers are taken from the pool defined by
+        <varname>max_worker_processes</varname>.
+       </para>
+       <para>
+        The default value is 0.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolution-interval" xreflabel="foreign_transaction_resolution_intervalription">
+      <term><varname>foreign_transaction_resolution_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolution_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify how long the foreign transaction resolver should wait when there
+        is not pending foreign transaction.
+       </para>
+       <para>
+        The default value is 10 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolver-timeout" xreflabel="foreign_transaction_resolver_timeout">
+      <term><varname>foreign_transaction_resolver_timeout</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolver_timeout</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets maximum time to wait for foreign transaction resolution.
+       </para>
+       <para>
+        The default value is 60 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
    </sect1>
 
    <sect1 id="runtime-config-query">
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 4250a03..f497b6e 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1757,4 +1757,91 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+  <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</productname> transaction manager allows FDWs to read
+    and write data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</productname>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</productname> using <function>RegisterXactForeignServer</function>.
+<programlisting>
+void
+FdwXactRegisterForeignServer(Oid serverid,
+                             Oid userid,
+                             bool two_phase_compliant,
+                             bool modify)
+</programlisting>
+    <varname>two_phase_compliant</varname> should be true if the foreign server
+    supports two-phase commit protocol, false otherwise. <varname>modify</varname>
+    should be true if you're attempting to modify data on foreign server in current
+    transaction.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</varname> is more than zero
+    <productname>PostgreSQL</productname> employs Two-phase commit protocol to
+    achieve atomic distributed transaction. All the foreign servers registered
+    should support two-phase commit protocol. The two-phase commit protocol is
+    used for achieving atomic distributed transaction when more than two foreign
+    servers that support two-phase commit protocol are involved with transaction,
+    or transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit protocol is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol
+    the commit is processed in two phases: prepare phase and commit phase.
+    In prepare phase, <productname>PostgreSQL</productname> prepares the transactions
+    on all the foreign servers registered using
+    <function>FdwXactRegisterForeignServer</function>. If any of the foreign server fails
+    to prepare transaction, prepare phase fails. In commit phase, all the prepared
+    transactions are committed if prepare phase has succeeded or rolled back if
+    prepare phase fails to prepare transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</function> to get the prepared transaction
+    identifier for each foreign server involved. It stores this identifier along
+    with the serverid and userid for later use. It then calls
+    <function>ResolvePreparedForeignTranscation</function> with the same identifier.
+    </para>
+    
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</function> with the same identifier with
+    action FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_resolve_foreign_xacts</function>, or by foreign
+    transaction resolve process if it's working.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</varname> is zero, atomicity
+    commit can not be guaranteed across foreign servers. If transaction on
+    <productname>PostgreSQL</productname> is committed, Distributed transaction
+    manager commit the transaction on all the foreign servers registered using
+    <function>FdwXactRegisterForeignServer</function>, independent of the outcome
+    of the same operation on other foreign servers. Thus transactions on some
+    foreign servers may be committed, while the same on other foreign servers
+    would be rolled back. If the transaction on <productname>PostgreSQL</productname>
+    aborts transactions on all the foreign servers are aborted too.
+    </para>
+  </sect1>
  </chapter>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..742e825 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,9 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdwxactdesc.o \
+	genericdesc.o  gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdwxactdesc.c b/src/backend/access/rmgrdesc/fdwxactdesc.c
new file mode 100644
index 0000000..b262645
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdwxactdesc.c
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdwxact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FdwXactOnDiskData *fdw_insert_xlog = (FdwXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: %s",
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f72f076..d5ce90d 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,16 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s "
+						 "max_prepared_foreign_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..90d0056 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,9 +12,9 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
-	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
-	xact.o xlog.o xlogarchive.o xlogfuncs.o \
+OBJS = clog.o commit_ts.o fdwxact.o fdwxact_resolver.o generic_xlog.o multixact.o \
+	parallel.o rmgr.o slru.o subtrans.o timeline.o transam.o twophase.o \
+	twophase_rmgr.o varsup.o xact.o xlog.o xlogarchive.o xlogfuncs.o \
 	xloginsert.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/fdwxact.c b/src/backend/access/transam/fdwxact.c
new file mode 100644
index 0000000..fd6787b
--- /dev/null
+++ b/src/backend/access/transam/fdwxact.c
@@ -0,0 +1,2418 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdwxact.c
+ *
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign servers.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server. it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid fo foreign server and user.
+ *
+ * The commit is executed in two phases. In the first phase executed during
+ * pre-commit phase, transactions are prepared on all the foreign servers,
+ * which can participate in two-phase commit protocol. Transaction on other
+ * foreign servers are committed in the same phase. In the second phase, if
+ * first phase doesn not succeed for whatever reason, the foreign servers
+ * are asked to rollback respective prepared transactions or abort the
+ * transactions if they are not prepared. This process is executed by backend
+ * process that executed the first phase. If the first phase succeeds, the
+ * backend process registers ourselves to the queue in the shared memory and then
+ * ask the foreign transaction resolver process to resolve foreign transactions
+ * that are associated with the its transaction. After resolved all foreign
+ * transactions by foreign transaction resolve process the backend wakes up
+ * and resume to process.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved (aka dangling transaction). During the
+ * first phase, before actually preparing the transactions, enough information
+ * is persisted to the dick and logs in order to resolve such transactions.
+ *
+ * During replay WAL and replication FdwXactCtl also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdwxact records happens by the following rules:
+ *
+ * 	* On PREPARE redo we add the foreign transaction to FdwXactCtl->fdw_xacts.
+ *	  We set fdw_xact->inredo to true for such entries.
+ *	* On Checkpoint redo we iterate through FdwXactCtl->fdw_xacts entries that
+ *	  that have fdw_xact->inredo set and are behind the redo_horizon.
+ *	  We save them to disk and alos set fdw_xact->ondisk to true.
+ *	* On COMMIT and ABORT we delete the entry from FdwXactCtl->fdw_xacts.
+ *	  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *	  the disk as well.
+ *	* RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+ *	  have been modified to go through fdw_xact->inredo entries that have
+ *	  not made to disk yet.
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/htup_details.h"
+#include "access/twophase.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/ps_status.h"
+#include "utils/snapmgr.h"
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FdwXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	bool		modified;		/* modified on foreign server in the transaction */
+	GetPrepareId_function get_prepare_id;
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FdwConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFdwConnections = NIL;
+
+/* Shmem hash entry */
+typedef struct
+{
+	/* tag */
+	TransactionId	xid;
+
+	/* data */
+	FdwXact	first_entry;
+} FdwXactHashEntry;
+
+static HTAB	*FdwXactHash;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FdwXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+			 serverid, userid)
+
+/*
+ * If no backend locks it and the local transaction is not in progress
+ * we can regards it as a dangling transaction.
+ */
+#define IsDanglingFdwXact(fx) \
+	(((FdwXact) (fx))->locking_backend == InvalidBackendId && \
+		 !TransactionIdIsInProgress(((FdwXact)(fx))->local_xid))
+
+static FdwXact FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_info);
+static void FdwXactPrepareForeignTransactions(void);
+static void AtProcExit_FdwXact(int code, Datum arg);
+static bool FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+											 ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static void UnlockFdwXact(FdwXact fdw_xact);
+static void UnlockMyFdwXacts(void);
+static void remove_fdw_xact(FdwXact fdw_xact);
+static FdwXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, char *fdw_xact_id);
+static int	GetFdwXactList(FdwXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FdwXact fdw_xact);
+static FdwXactOnDiskData *ReadFdwXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len);
+static FdwXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+static bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+							List **qualifying_xacts);
+
+static void FdwXactQueueInsert(void);
+static void FdwXactCancelWait(void);
+
+/*
+ * Maximum number of foreign prepared transaction entries at any given time
+ * GUC variable, change requires restart.
+ */
+int			max_prepared_foreign_xacts = 0;
+
+int			max_foreign_xact_resolvers = 0;
+
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* foreign transaction entries locked by this backend */
+List	   *MyLockedFdwXacts = NIL;
+FdwXactResolver *MyFdwXactResolver = NULL;
+
+/* Record the server, userid participating in the transaction. */
+void
+FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool two_phase_commit,
+							 bool modify)
+{
+	FdwConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Quick return if the entry already exists */
+	foreach(lcell, MyFdwConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		/* Quick return if there is already registered connection */
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+		{
+			fdw_conn->modified |= modify;
+			return;
+		}
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FdwConnection *) palloc(sizeof(FdwConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (max_foreign_xact_resolvers == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_foreign_xact_resolvers to a nonzero value.")));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->modified = modify;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFdwConnections = lappend(MyFdwConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/*
+ * FdwXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+Size
+FdwXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FdwXactCtlData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXactData)));
+
+	size = MAXALIGN(size);
+	size = add_size(size, hash_estimate_size(max_prepared_foreign_xacts,
+											 sizeof(FdwXactHashEntry)));
+
+	return size;
+}
+
+/*
+ * FdwXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FdwXactCtlData structure.
+ */
+void
+FdwXactShmemInit(void)
+{
+	bool		found;
+
+	FdwXactCtl = ShmemInitStruct("Foreign transactions table",
+								 FdwXactShmemSize(),
+								 &found);
+	if (!IsUnderPostmaster)
+	{
+		FdwXact		fdw_xacts;
+		HASHCTL		info;
+		long		init_hash_size;
+		long		max_hash_size;
+		int			cnt;
+
+		Assert(!found);
+		FdwXactCtl->freeFdwXacts = NULL;
+		FdwXactCtl->numFdwXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FdwXact)
+			((char *) FdwXactCtl +
+			 MAXALIGN(offsetof(FdwXactCtlData, fdw_xacts) +
+					  sizeof(FdwXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = &fdw_xacts[cnt];
+		}
+
+		MemSet(&info, 0, sizeof(info));
+		info.keysize = sizeof(TransactionId);
+		info.entrysize = sizeof(FdwXactHashEntry);
+
+		max_hash_size = max_prepared_foreign_xacts;
+		init_hash_size = max_hash_size / 2;
+
+		FdwXactHash = ShmemInitHash("FdwXact hash",
+									init_hash_size,
+									max_hash_size,
+									&info,
+									HASH_ELEM | HASH_BLOBS);
+	}
+	else
+	{
+		Assert(FdwXactCtl);
+		Assert(found);
+	}
+}
+
+
+/*
+ * PreCommit_FdwXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FdwXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFdwConnections), prev = NULL; cur; cur = next)
+	{
+		FdwConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		/*
+		 * We commit the foreign transactions on servers either that cannot
+		 * execute two-phase-commit protocol or that we didn't modified on
+		 * in pre-commit phase.
+		 */
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here foreign servers that can not execute two-phase-commit protocol
+	 * already commit the transaction and MyFdwConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We don't need to
+	 * use two-phase-commit protocol if there is only one foreign server that
+	 * that can execute two-phase-commit and didn't write no local node.
+	 */
+	if ((list_length(MyFdwConnections) > 1) ||
+		(list_length(MyFdwConnections) == 1 && XactWriteLocalNode))
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		FdwXactPrepareForeignTransactions();
+	}
+	else if (list_length(MyFdwConnections) == 1)
+	{
+		FdwConnection *fdw_conn = lfirst(list_head(MyFdwConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server
+		 * remaining even if this server can execute two-phase-commit
+		 * protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFdwConnections should be cleared here */
+		MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+FdwXactPrepareForeignTransactions(void)
+{
+	ListCell   *lcell;
+	FdwXact		prev_fdwxact = NULL;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFdwConnections)
+	{
+		FdwConnection *fdw_conn = (FdwConnection *) lfirst(lcell);
+		char	    *fdw_xact_id;
+		int			fdw_xact_id_len;
+		FdwXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+			continue;
+
+
+		/* Generate prepare transaction id for foreign server */
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = FdwXactRegisterFdwXact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id);
+
+		/*
+		 * Between FdwXactRegisterFdwXact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by the foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id))
+		{
+			StringInfo servername;
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			servername = makeStringInfo();
+			appendStringInfoString(servername, fdw_conn->servername);
+			MyFdwConnections = list_delete_ptr(MyFdwConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							servername->data)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+
+		/*
+		 * If this is the first fdwxact entry we keep it in the hash table for
+		 * the later use.
+		 */
+		if (!prev_fdwxact)
+		{
+			FdwXactHashEntry	*fdwxact_entry;
+			bool				found;
+			TransactionId		key;
+
+			key = fdw_xact->local_xid;
+
+			LWLockAcquire(FdwXactLock,LW_EXCLUSIVE);
+			fdwxact_entry = (FdwXactHashEntry *) hash_search(FdwXactHash,
+															 &key,
+															 HASH_ENTER, &found);
+			LWLockRelease(FdwXactLock);
+
+			Assert(!found);
+			fdwxact_entry->first_entry = fdw_xact;
+		}
+		else
+		{
+			/*
+			 * Make a list of fdwxacts that are associated with the
+			 * same local transaction.
+			 */
+			Assert(fdw_xact->fx_next == NULL);
+			prev_fdwxact->fx_next = fdw_xact;
+		}
+
+		prev_fdwxact = fdw_xact;
+	}
+
+	return;
+}
+
+/*
+ * FdwXactRegisterFdwXact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FdwXact
+FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+					   Oid umid, char *fdw_xact_id)
+{
+	FdwXact		fdw_xact;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	MemoryContext	old_context;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid, fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+
+	LWLockRelease(FdwXactLock);
+
+	/* Remember that we have locked this entry. */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+	MemoryContextSwitchTo(old_context);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FdwXactOnDiskData, fdw_xact_id);
+	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FdwXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   FDW_XACT_ID_LEN);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FdwXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FdwXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				char *fdw_xact_id)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FdwXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get the next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FdwXactCtl->freeFdwXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions : \"%d\".",
+						 max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FdwXactCtl->freeFdwXacts;
+	FdwXactCtl->freeFdwXacts = fdw_xact->fx_free_next;
+
+	/* Insert the entry to active array */
+	Assert(FdwXactCtl->numFdwXacts < max_prepared_foreign_xacts);
+	FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FdwXact fdw_xact)
+{
+	int			cnt;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		if (FdwXactCtl->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FdwXactCtl->numFdwXacts--;
+			FdwXactCtl->fdw_xacts[cnt] = FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			fdw_xact->fx_next = NULL;
+			MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdw_xact);
+
+			LWLockRelease(FdwXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+	LWLockRelease(FdwXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find %p in FdwXactCtl array", fdw_xact);
+}
+
+bool
+TwoPhaseCommitRequired(void)
+{
+	if ((list_length(MyFdwConnections) > 1) ||
+		(list_length(MyFdwConnections) == 1 && XactWriteLocalNode))
+		return true;
+
+	return false;
+}
+
+/*
+ * UnlockFdwXact
+ *
+ * Unlock the foreign transaction entry by wiping out the locking_backend and
+ * removing it from the backend's list of foreign transaction.
+ */
+static void
+UnlockFdwXact(FdwXact fdw_xact)
+{
+	/* Only the backend holding the lock is allowed to unlock */
+	Assert(fdw_xact->locking_backend == MyBackendId);
+
+	/*
+	 * First set the locking backend as invalid, and then remove it from the
+	 * list of locked foreign transactions, under the LW lock. If we reverse
+	 * the order and process exits in-between those two, we will be left an
+	 * entry locked by this backend, which gets unlocked only at the server
+	 * restart.
+	 */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact->locking_backend = InvalidBackendId;
+	MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdw_xact);
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * UnlockMyFdwXacts
+ *
+ * Unlock the foreign transaction entries locked by this backend.
+ */
+static void
+UnlockMyFdwXacts(void)
+{
+	ListCell *cell;
+	ListCell *next;
+
+	for (cell = list_head(MyLockedFdwXacts); cell != NULL; cell = next)
+	{
+		FdwXact	fdwxact = (FdwXact) lfirst(cell);
+
+		next = lnext(cell);
+
+		/*
+		 * It can happen that the FdwXact entries that are pointed by
+		 * MyLockedFdwXacts is already used by another backend because
+		 * another backend can use it after the resolver process removed
+		 * and it before we unlock. So we unlock only  FdwXact entries
+		 * that was locked by MyBackendId.
+		 */
+		if (fdwxact->locking_backend == MyBackendId)
+			UnlockFdwXact(fdwxact);
+	}
+}
+
+/*
+ * AtProcExit_FdwXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FdwXact(int code, Datum arg)
+{
+	UnlockMyFdwXacts();
+}
+
+/*
+ * Wait for foreign transaction resolution, if requested by user.
+ *
+ * Initially backends start in state FDW_XACT_NOT_WAITING and then
+ * change that state to FDW_XACT_WAITING before adding ourselves
+ * to the wait queue. During FdwXactResolveForeignTransactions a fdwxact
+ * resolver changes the state to FDW_XACT_WAIT_COMPLETE once foreign
+ * transactions are resolved. This backend then resets its state
+ * to FDW_XACT_NOT_WAITING. If fdwxact_list is NULL, it means that
+ * we use the list of FdwXact just used, so set it to MyLockedFdwXacts.
+ *
+ * This function is inspired by SyncRepWaitForLSN.
+ */
+void
+FdwXactWaitForResolve(TransactionId wait_xid, bool is_commit)
+{
+	char		*new_status = NULL;
+	const char	*old_status;
+	ListCell	*cell;
+	List		*entries_to_resolve;
+
+	/*
+	 * Quick exit if user has not requested foreign transaction resolution
+	 * or there are no foreign servers that are modified in the current
+	 * transaction.
+	 */
+	if (!FdwXactEnabled())
+		return;
+
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	Assert(FdwXactCtl != NULL);
+	Assert(TransactionIdIsValid(wait_xid));
+
+	Assert(MyProc->fdwXactState == FDW_XACT_NOT_WAITING);
+
+	/*
+	 * Get the list of foreign transactions that are involved with the
+	 * given wait_xid.
+	 */
+	search_fdw_xact(wait_xid, MyDatabaseId, InvalidOid, InvalidOid,
+					&entries_to_resolve);
+
+	/* Quick exit if we found no foreign transaction that we need to resolve */
+	if (list_length(entries_to_resolve) <= 0)
+		return;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Change status of fdw_xact entries according to is_commit */
+	foreach (cell, entries_to_resolve)
+	{
+		FdwXact fdw_xact = (FdwXact) lfirst(cell);
+
+		/* Don't overwrite status if fate is determined */
+		if (fdw_xact->status == FDW_XACT_PREPARING)
+			fdw_xact->status = (is_commit ?
+								FDW_XACT_COMMITTING_PREPARED :
+								FDW_XACT_ABORTING_PREPARED);
+	}
+
+	/* Set backend status and enqueue ourselved */
+	MyProc->fdwXactState = FDW_XACT_WAITING;
+	MyProc->waitXid = wait_xid;
+	FdwXactQueueInsert();
+	LWLockRelease(FdwXactLock);
+
+	/* Launch a resolver process if not yet and then wake up it */
+	fdwxact_maybe_launch_resolver();
+
+	/*
+	 * Alter ps display to show waiting for foreign transaction
+	 * resolution.
+	 */
+	if (update_process_title)
+	{
+		int len;
+
+		old_status = get_ps_display(&len);
+		new_status = (char *) palloc(len + 31 + 1);
+		memcpy(new_status, old_status, len);
+		sprintf(new_status + len, " waiting for resolve %d", wait_xid);
+		set_ps_display(new_status, false);
+		new_status[len] = '\0';	/* truncate off "waiting ..." */
+	}
+
+	/* Wait for all foreign transactions to be resolved */
+	for (;;)
+	{
+		/* Must reset the latch before testing state */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Acquiring the lock is not needed, the latch ensures proper
+		 * barriers. If it looks like we're done, we must really be done,
+		 * because once walsender changes the state to FDW_XACT_WAIT_COMPLETE,
+		 * it will never update it again, so we can't be seeing a stale value
+		 * in that case.
+		 */
+		if (MyProc->fdwXactState == FDW_XACT_WAIT_COMPLETE)
+			break;
+
+		/*
+		 *
+		 */
+		if (ProcDiePending)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("canceling the wait for resolving foreign transaction and terminating connection due to administrator command"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If a query cancel interrupt arrives we just terminate the wait with
+		 * a suitable warning. The foreign transactions can be orphaned but
+		 * the foreign xact resolver can pick up them and tries to resolve them
+		 * later.
+		 */
+		if (QueryCancelPending)
+		{
+			QueryCancelPending = false;
+			ereport(WARNING,
+					(errmsg("canceling wait for resolving foreign transaction due to user request"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If the postmaster dies, we'll probably never get an
+		 * acknowledgement, because all the wal sender processes will exit. So
+		 * just bail out.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * Wait on latch.  Any condition that should wake us up will set the
+		 * latch, so no need for timeout.
+		 */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+				  WAIT_EVENT_FDW_XACT_RESOLUTION);
+	}
+
+	pg_read_barrier();
+
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+
+	/*
+	 * Unlock the list of locked entries, also means that the entries
+	 * that could not resolved are remained as dangling transactions.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+}
+
+static void
+FdwXactCancelWait(void)
+{
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+		SHMQueueDelete(&(MyProc->fdwXactLinks));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	LWLockRelease(FdwXactLock);
+}
+
+static void
+FdwXactQueueInsert(void)
+{
+	SHMQueueInsertBefore(&(FdwXactRslvCtl->FdwXactQueue),
+						 &(MyProc->fdwXactLinks));
+}
+
+/*
+ * Resolve foreign transactions in given dbid, that are associated with
+ * the same local transaction and then release the waiter after resolved
+ * all foreign transactions.
+ */
+bool
+FdwXactResolveForeignTransactions(Oid dbid)
+{
+	TransactionId		key;
+	volatile FdwXact	fdwxact = NULL;
+	volatile FdwXact	fx_next;
+	FdwXactHashEntry	*fdwxact_entry;
+	bool	found;
+	PGPROC	*proc;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Fetch an proc from beginning of the queue */
+	for (;;)
+	{
+		proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->FdwXactQueue),
+									   &(FdwXactRslvCtl->FdwXactQueue),
+									   offsetof(PGPROC, fdwXactLinks));
+
+		/* Return if there is not any entry in the queue */
+		if (!proc)
+		{
+			LWLockRelease(FdwXactLock);
+			return false;
+		}
+
+		/* Found a target proc we need to resolve */
+		if (proc->databaseId == dbid)
+			break;
+	}
+
+	/* Search fdwxact entry from shmem hash by local transaction id */
+	key = proc->waitXid;
+	fdwxact_entry = (FdwXactHashEntry *) hash_search(FdwXactHash,
+													 (void *) &key,
+													 HASH_FIND, &found);
+
+	/*
+	 * After recovery, there might not be prepared foreign transaction
+	 * entries in the hash map on shared memory. If we could not find the
+	 * entry we next scan over FdwXactCtl->fdw_xacts array.
+	 */
+	if (!found)
+	{
+		int i;
+		FdwXact prev_fx = NULL;
+
+		found = false;
+		for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+		{
+			FdwXact fx = FdwXactCtl->fdw_xacts[i];
+
+			if (fx->dboid == dbid && fx->local_xid == proc->waitXid)
+			{
+				found = true;
+
+				/* Save first entry of the list */
+				if (fdwxact == NULL)
+					fdwxact = fx;
+
+				/* LInk from previous entry */
+				if (prev_fx)
+					prev_fx->fx_next = fx;
+
+				prev_fx = fx;
+			}
+		}
+
+		LWLockRelease(FdwXactLock);
+
+		if (!found)
+			ereport(ERROR,
+					(errmsg("foreign transaction for local transaction id \"%d\" does not exist",
+							proc->waitXid)));
+	}
+	else
+	{
+		LWLockRelease(FdwXactLock);
+		fdwxact = fdwxact_entry->first_entry;
+	}
+
+	/* Resolve all foreign transactions associated with pgxact->xid */
+	while (fdwxact)
+	{
+		/*
+		 * Remember the next entry to resolve since current entry
+		 * could be removed after resolved.
+		 */
+		fx_next = fdwxact->fx_next;
+
+		if (!FdwXactResolveForeignTransaction(fdwxact, get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			/*
+			 * If failed to resolve, we leave the all remaining entries. Because
+			 * we didn't remove this proc entry from shmem hash table, we will
+			 * try to resolve again later. Until we resolved the all foreign
+			 * transactions we don't should release the waiter.
+			 *
+			 * XXX : We might have to try to resolve the remaining transactions
+			 * as much as possible.
+			 * XXX : If the resolution is failed due to e.g. network problem.
+			 * we might have to get into a loop.
+			 * XXX : If the resolution failed because the prepared doesn't
+			 * exist on the foreign server, we should regard that as if we had
+			 * succeeded in resolving the transaction.
+			 */
+			fdwxact_entry->first_entry = fdwxact;
+			LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+			return false;
+		}
+
+		fdwxact = fx_next;
+	}
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* We remove proc from shmem hash table as well */
+	hash_search(FdwXactHash, (void *) &key, HASH_REMOVE, NULL);
+
+	/* Remove proc from queue */
+	SHMQueueDelete(&(proc->fdwXactLinks));
+
+	pg_write_barrier();
+
+	/* Set state to complete */
+	proc->fdwXactState = FDW_XACT_WAIT_COMPLETE;
+
+	/* Wake up the waiter only when we have set state and removed from queue */
+	SetLatch(&(proc->procLatch));
+	LWLockRelease(FdwXactLock);
+
+	return true;
+}
+
+bool
+FdwXactResolveDanglingTransactions(Oid dbid)
+{
+	List		*fxact_list = NIL;
+	ListCell	*cell;
+	bool		resolved = false;
+	int			i;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/*
+	 * Get the list of in-doubt transactions that corresponding local
+	 * transaction is on same database.
+	 */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		FdwXact fxact = FdwXactCtl->fdw_xacts[i];
+
+		/*
+		 * Append it to the list if the fdwxact entry is both
+		 * not locked by anyone and on the same database.
+		 */
+		if (fxact->dboid == dbid &&
+			fxact->locking_backend == InvalidBackendId &&
+			!TwoPhaseExists(fxact->local_xid))
+			fxact_list = lappend(fxact_list, fxact);
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	if (list_length(fxact_list) == 0)
+		return false;
+
+	foreach(cell, fxact_list)
+	{
+		FdwXact fdwxact = (FdwXact) lfirst(cell);
+
+		elog(DEBUG1, "DANGLING fdwxact xid %X server %X at %p next %p",
+			 fdwxact->local_xid, fdwxact->serverid,
+			 fdwxact, fdwxact->fx_next);
+
+		if (!FdwXactResolveForeignTransaction(fdwxact, get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			/* Emit error */
+		}
+		else
+			resolved = true;
+	}
+
+	list_free(fxact_list);
+
+	return resolved;
+}
+
+/*
+ * AtEOXact_FdwXacts
+ *
+ */
+extern void
+AtEOXact_FdwXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	/*
+	 * In commit case, we already committed the foreign transactions on the
+	 * servers that cannot execute two-phase commit protocol, and prepared
+	 * transaction on the server that can use two-phase commit protocol
+	 * in-precommit phase. And the prepared transactions should be resolved
+	 * by the resolver process. On the other hand in abort case, since we
+	 * might either prepare or be preparing some transactions on foreign
+	 * servers we need to abort prepared transactions while just abort the
+	 * foreign transaction that are not prepared yet.
+	 */
+	if (!is_commit)
+	{
+		foreach (lcell, MyFdwConnections)
+		{
+			FdwConnection	*fdw_conn = lfirst(lcell);
+
+			/*
+			 * Since the prepared foreign transaction should have been
+			 * resolved we abort the remaining not-prepared foreign
+			 * transactions.
+			 */
+			if (!fdw_conn->fdw_xact)
+			{
+				bool ret;
+
+				ret = fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+												 fdw_conn->umid, is_commit);
+				if (!ret)
+					ereport(WARNING, (errmsg("could not abort transaction on server \"%s\"",
+											 fdw_conn->servername)));
+			}
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Other backend might lock the
+	 * entry we used to lock, but there is no reason for a foreign transaction
+	 * entry to be locked after the transaction which locked it has ended.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFdwConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FdwXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ *
+ * Note that it can happen that the transaction abort after we prepared foreign
+ * transactions. So we cannot unlock both MyLockedFdwXacts and MyFdwConnections
+ * here. These are unlocked after rollbacked by resolver process during
+ * aborting, or at EOXact_FdwXacts().
+ */
+void
+AtPrepare_FdwXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	FdwXactPrepareForeignTransactions();
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FdwXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * FdwXactResolveForeignTransaction
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine. The foreign transaction can be a dangling transaction
+ * that is not decided to commit or abort.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+			   ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	if(!(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		 fdw_xact->status == FDW_XACT_ABORTING_PREPARED))
+		elog(DEBUG1, "fdwxact status : %d", fdw_xact->status);
+
+	/*
+	 * Determine whether we commit or abort this foreign transaction.
+	 */
+	if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED)
+		is_commit = true;
+	else if (fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+		is_commit = false;
+	else if (TransactionIdDidCommit(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+		is_commit = true;
+	}
+	else if (TransactionIdDidAbort(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+		is_commit = false;
+	}
+	else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+	{
+		/*
+		 * The local transaction is not in progress but the foreign
+		 * transaction is not prepared on the foreign server. This
+		 * can happen when we crashed after registered this entry but
+		 * before actual preparing on the foreign server. So we assume
+		 * it to be aborted.
+		 */
+		is_commit = false;
+	}
+	else
+	{
+		/*
+		 * The Local transaction is in progress and foreign transaction
+		 * state is neither committing or aborting. This should not
+		 * happen we cannot determine to do commit or abort for foreign
+		 * transaction associated with the in-progress local transaction.
+		 */
+		ereport(ERROR,
+				(errmsg("cannot resolve foreign transaction associated with in-progress transaction %u on server %u",
+						fdw_xact->local_xid, fdw_xact->serverid)));
+	}
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FdwXactCtl->fdw_xacts. Return NULL
+ * if foreign transaction does not exist.
+ */
+static FdwXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FdwXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+static bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FdwXactLock, lock_mode);
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FdwXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FdwXact entry and file if exists */
+		FdwXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFdwXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FdwXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFdwXacts(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FdwXactCtl->numFdwXacts <= 0)
+	{
+		LWLockRelease(FdwXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FdwXact that need to be copied to
+	 * disk, so we perform all I/O while holding FdwXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FdwXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFdwXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFdwXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FdwXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+Datum
+pg_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFdwXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FdwXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FdwXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFdwXactList(FdwXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	if (FdwXactCtl->numFdwXacts == 0)
+	{
+		LWLockRelease(FdwXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FdwXactCtl->numFdwXacts;
+	*fdw_xacts = (FdwXact) palloc(sizeof(FdwXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FdwXactCtl->fdw_xacts[cnt_xacts],
+			   sizeof(FdwXactData));
+
+	LWLockRelease(FdwXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_fdw_xact_remove(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId xid;
+	Oid			dbid;
+	Oid			serverid;
+	Oid			userid;
+	List	   *entries_to_remove;
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+		DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FdwXact		fdw_xact = linitial(entries_to_remove);
+
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
++ * Resolve foreign transactions on the connecting database manually. This
++ * function returns true if we resolve any foreign transaction, otherwise
++ * return false.
++ */
+Datum
+pg_resolve_foreign_xacts(PG_FUNCTION_ARGS)
+{
+	bool    ret;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to resolve foreign transactions"))));
+
+	ret = FdwXactResolveForeignTransactions(MyDatabaseId);
+	PG_RETURN_BOOL(ret);
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFdwXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FdwXactOnDiskData *
+ReadFdwXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FdwXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FdwXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFdwXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFdwXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFdwXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFdwXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FdwXactOnDiskData *fdw_xact_file_data;
+			FdwXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFdwXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FdwXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FdwXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FdwXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FdwXactData structure.
+ */
+void
+FdwXactRedoAdd(XLogReaderState *record)
+{
+	FdwXactOnDiskData *fdw_xact_data = (FdwXactOnDiskData *) XLogRecGetData(record);
+	FdwXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	fdw_xact->valid = true;
+	LWLockRelease(FdwXactLock);
+}
+/*
+ * FdwXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FdwXactCtl.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FdwXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFdwXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/fdwxact_resolver.c b/src/backend/access/transam/fdwxact_resolver.c
new file mode 100644
index 0000000..6d3d08c
--- /dev/null
+++ b/src/backend/access/transam/fdwxact_resolver.c
@@ -0,0 +1,532 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver.c
+ *
+ * PostgreSQL foreign transaction resolver worker
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/fdwxact_resolver.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/resolver_private.h"
+
+#include "funcapi.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* GUC parameters */
+int foreign_xact_resolution_interval;
+int foreign_xact_resolver_timeout = 60 * 1000;
+
+FdwXactRslvCtlData *FdwXactRslvCtl;
+
+static long FdwXactRslvComputeSleepTime(TimestampTz now);
+static void FdwXactRslvProcessForeignTransactions(void);
+
+static void fdwxact_resolver_sighup(SIGNAL_ARGS);
+static void fdwxact_resolver_onexit(int code, Datum arg);
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactResolverShmemSize(void)
+{
+	Size		size = 0;
+
+	size = add_size(size, mul_size(max_foreign_xact_resolvers,
+								   sizeof(FdwXactResolver)));
+
+	return size;
+}
+
+/*
+ * Allocate and initialize foreign transaction resolver shared
+ * memory.
+ */
+void
+FdwXactResolverShmemInit(void)
+{
+	bool found;
+
+	FdwXactRslvCtl = ShmemInitStruct("Foreign transactions resolvers",
+									 FdwXactResolverShmemSize(),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		int	slot;
+
+		for (slot = 0; slot < max_foreign_xact_resolvers; slot++)
+		{
+			FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[slot];
+
+			/* Initialize */
+			MemSet(resolver, 0, sizeof(FdwXactResolver));
+		}
+
+		SHMQueueInit(&(FdwXactRslvCtl->FdwXactQueue));
+	}
+	else
+	{
+		Assert(FdwXactCtl);
+		Assert(found);
+	}
+}
+
+/*
+ * Cleanup function for foreign transaction resolver
+ */
+static void
+fdwxact_resolver_onexit(int code, Datum arg)
+{
+	MyFdwXactResolver->pid = InvalidPid;
+	MyFdwXactResolver->in_use = false;
+}
+
+void
+fdwxact_resolver_attach(int slot)
+{
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	MyFdwXactResolver = &FdwXactRslvCtl->resolvers[slot];
+	MyFdwXactResolver->pid = MyProcPid;
+	MyFdwXactResolver->latch = &MyProc->procLatch;
+
+	if (!MyFdwXactResolver->in_use)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("foreign transaction resolver slot %d is empty, cannot attach",
+						slot)));
+	}
+
+	before_shmem_exit(fdwxact_resolver_onexit, (Datum) 0);
+
+	LWLockRelease(FdwXactResolverLock);
+}
+
+/* Set flag to reload configuration at next convenient time */
+static void
+fdwxact_resolver_sighup(SIGNAL_ARGS)
+{
+	int		save_errno = errno;
+
+	got_SIGHUP = true;
+
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/* Foreign transaction resolver entry point */
+void
+FdwXactRslvMain(Datum main_arg)
+{
+	int slot = DatumGetInt32(main_arg);
+
+	fdwxact_resolver_attach(slot);
+
+	elog(DEBUG1, "foreign transaciton resolver for database %u started",
+		 MyFdwXactResolver->dbid);
+
+	/* Establish signal handlers */
+	pqsignal(SIGHUP, fdwxact_resolver_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	/* Establish connection to nailed catalogs */
+	BackgroundWorkerInitializeConnectionByOid(MyFdwXactResolver->dbid, InvalidOid);
+
+	for (;;)
+	{
+		int			rc;
+		TimestampTz	now;
+		long		sleep_time;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/* Resolve pending transactions if there are */
+		FdwXactRslvProcessForeignTransactions();
+
+		now = GetCurrentTimestamp();
+
+		sleep_time = FdwXactRslvComputeSleepTime(now);
+
+		/*
+		 * We reached to the timeout here. We can exit only if
+		 * there is on remaining task registered by backend processes. Check
+		 * it and then close the business while holding FdwXactResolverLaunchLock.
+		 */
+		if (sleep_time < 0)
+		{
+			LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+			Assert(MyFdwXactResolver->num_entries >= 0);
+
+			if (MyFdwXactResolver->num_entries == 0)
+			{
+				/*
+				 * There is no more transactions we need to resolve,
+				 * turn off my slot while holding lock so that concurrent
+				 * backends cannot register additional entries.
+				 */
+				MyFdwXactResolver->in_use = false;
+
+				LWLockRelease(FdwXactResolverLock);
+
+				ereport(LOG,
+						(errmsg("foreign transaction resolver for database \"%u\" will stop because the timeout",
+								MyFdwXactResolver->dbid)));
+
+				proc_exit(0);
+			}
+
+			LWLockRelease(FdwXactResolverLock);
+
+			/*
+			 * We realized that we got tasks from backend process the meantime
+			 * of checking. Since we know we have the transaction we need to resolve
+			 * we don't want to sleep.
+			 */
+			sleep_time = 0;
+		}
+
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   sleep_time,
+					   WAIT_EVENT_FDW_XACT_RESOLVER_MAIN);
+
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+	}
+}
+
+/*
+ * Process all foreign transactions on the database it's connecting to. If we
+ * succeeded in resolution we can update the last resolution time. When we resolved
+ * no foreign transaction in a cycle we return.
+ */
+static void
+FdwXactRslvProcessForeignTransactions(void)
+{
+	int	n_fx;
+
+	/* Quick exist if there is no registered entry */
+	LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+	n_fx = MyFdwXactResolver->num_entries;
+	LWLockRelease(FdwXactResolverLock);
+
+	if (n_fx == 0)
+		return;
+
+	//elog(WARNING, "pid %d sleep", MyProcPid);
+	//pg_usleep(30000000L);
+	//pg_usleep(30000000L);
+
+	/*
+	 * Loop until there are no more foreign transaction we need to resolve.
+	 */
+	for (;;)
+	{
+		bool	resolved_mydb;
+		bool	resolved_dangling;
+
+		StartTransactionCommand();
+
+		/* Resolve all foreign transaction associated with xid */
+		resolved_mydb = FdwXactResolveForeignTransactions(MyFdwXactResolver->dbid);
+
+		/* Resolve dangling transactions if there are */
+		resolved_dangling = FdwXactResolveDanglingTransactions(MyFdwXactResolver->dbid);
+
+		CommitTransactionCommand();
+
+		/* If we processed all entries so far */
+		if (resolved_mydb)
+		{
+			/* XXX : we should use spinlock or atomic operation */
+			LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+			MyFdwXactResolver->num_entries--;
+			MyFdwXactResolver->last_resolution_time = GetCurrentTimestamp();
+			LWLockRelease(FdwXactResolverLock);
+		}
+
+		if (!resolved_mydb && !resolved_dangling)
+			break;
+	}
+}
+
+/*
+ * Compute how long we should sleep by the next cycle. Return the sleep time
+ * in milliseconds, -1 means that we reached to the timeout and should exits
+ */
+static long
+FdwXactRslvComputeSleepTime(TimestampTz now)
+{
+	static TimestampTz	wakeuptime = 0;
+	long	sleeptime;
+	long	sec_to_timeout;
+	int		microsec_to_timeout;
+
+	if (foreign_xact_resolver_timeout > 0)
+	{
+		TimestampTz timeout;
+
+		timeout = TimestampTzPlusMilliseconds(MyFdwXactResolver->last_resolution_time,
+										  foreign_xact_resolver_timeout);
+
+		/* If we reached to the timeout, exit */
+		if (now >= timeout)
+			return -1;
+	}
+
+	if (now >= wakeuptime)
+		wakeuptime = TimestampTzPlusMilliseconds(now,
+												 foreign_xact_resolution_interval * 1000);
+
+	/* Compute relative time until wakeup. */
+	TimestampDifference(now, wakeuptime,
+						&sec_to_timeout, &microsec_to_timeout);
+
+	sleeptime = sec_to_timeout * 1000 + microsec_to_timeout / 1000;
+
+	return sleeptime;
+}
+
+/*
+ * Launch a new foreign transaction resolver worker if not launched yet.
+ * A foreign transaction resolver worker is responsible for the resolution
+ * of foreign transactions are registered on one database. So if a resolver
+ * worker already is launched by other backend we don't need to launch new
+ * one.
+ */
+void
+fdwxact_maybe_launch_resolver(void)
+{
+	FdwXactResolver *resolver = NULL;
+	BackgroundWorker bgw;
+	BackgroundWorkerHandle *bgw_handle;
+	int i;
+	int	slot;
+	bool	found = false;
+
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/*
+		 * Found a running resolver that is responsible for the
+		 * database "dbid".
+		 */
+		if (r->in_use && r->pid != InvalidPid && r->dbid == MyDatabaseId)
+		{
+			Assert(!found);
+			found = true;
+			resolver = r;
+		}
+	}
+
+	/*
+	 * If we found the resolver for my database, we don't need to launch new one
+	 * Add a task and wake up it.
+	 */
+	if (found)
+	{
+		resolver->num_entries++;
+		SetLatch(resolver->latch);
+		LWLockRelease(FdwXactResolverLock);
+		elog(DEBUG1, "found a running foreign transaction resolver process for database %u",
+			 MyDatabaseId);
+		return;
+	}
+
+	elog(DEBUG1, "starting foreign transaction resolver for datbase ID %u", MyDatabaseId);
+
+	/* Find unused worker slot */
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/* Found an used worker slot */
+		if (!r->in_use)
+		{
+			resolver = r;
+			slot = i;
+			break;
+		}
+	}
+
+	/*
+	 * However if there are no more free worker slots, inform user about it before
+	 * exiting.
+	 */
+	if (resolver == NULL)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of foreign trasanction resolver slots"),
+				 errhint("You might need to increase max_foreign_transaction_resolvers.")));
+
+		return;
+	}
+
+	/* Prepare the resolver slot. It's in use but pid is still invalid */
+	resolver->dbid = MyDatabaseId;
+	resolver->in_use = true;
+	resolver->num_entries = 1;
+	resolver->pid = InvalidPid;
+	TIMESTAMP_NOBEGIN(resolver->last_resolution_time);
+
+	LWLockRelease(FdwXactResolverLock);
+
+	/* Register the new dynamic worker */
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactRslvMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN,
+			 "foreign transaction resolver for database %u", MyDatabaseId);
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "foreign transaction resolver");
+	bgw.bgw_restart_time = BGW_NEVER_RESTART;
+	bgw.bgw_main_arg = (Datum) 0;
+	bgw.bgw_notify_pid = Int32GetDatum(slot);
+
+	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+	{
+		/* Failed to launch, cleanup the worker slot */
+		LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+		resolver->in_use = false;
+		LWLockRelease(FdwXactResolverLock);
+
+		ereport(WARNING,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of background worker slots"),
+				 errhint("You might need to increase max_worker_processes.")));
+	}
+
+	/*
+	 * We don't need to wait until it attaches here because we're going to wait
+	 * until all foreign transactions are resolved.
+	 */
+}
+
+/*
+ * Returns activity of foreign transaction resolvers, including pids, the number
+ * of tasks and the last resolution time.
+ */
+Datum
+pg_stat_get_fdwxact_resolvers(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_FDWXACT_RESOLVERS_COLS 4
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	int i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	LWLockAcquire(FdwXactResolverLock, LW_SHARED);
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver	*resolver = &FdwXactRslvCtl->resolvers[i];
+		pid_t	pid;
+		Oid		dbid;
+		int		num_entries;
+		TimestampTz last_resolution_time;
+		Datum		values[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+		bool		nulls[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+
+		if (resolver->pid == 0)
+			continue;
+
+		pid = resolver->pid;
+		dbid = resolver->dbid;
+		num_entries = resolver->num_entries;
+		last_resolution_time = resolver->last_resolution_time;
+
+		memset(nulls, 0, sizeof(nulls));
+		/* pid */
+		values[0] = Int32GetDatum(pid);
+
+		/* dbid */
+		values[1] = ObjectIdGetDatum(dbid);
+
+		/* n_entries */
+		values[2] = Int32GetDatum(num_entries);
+
+		/* last_resolution_time */
+		if (last_resolution_time == 0)
+			nulls[3] = true;
+		else
+			values[3] = TimestampTzGetDatum(last_resolution_time);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	LWLockRelease(FdwXactResolverLock);
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b360b1 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index cfaf8da..7104aea 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -845,6 +846,35 @@ TwoPhaseGetGXact(TransactionId xid)
 }
 
 /*
+ * TwoPhaseExists
+ *		Return true if there is a prepared transaction specified by XID
+ */
+bool
+TwoPhaseExists(TransactionId xid)
+{
+	int		i;
+	bool	found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGXACT	*pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+
+		if (pgxact->xid == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return found;
+}
+
+/*
  * TwoPhaseGetDummyProc
  *		Get the dummy backend ID for prepared transaction specified by XID
  *
@@ -2240,6 +2270,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, true);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitForResolve(xid, true);
 }
 
 /*
@@ -2298,6 +2334,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, false);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitForResolve(xid, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 52e1286..f2b3e31 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1131,6 +1132,7 @@ RecordTransactionCommit(void)
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
+	bool		need_twophase;
 
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
@@ -1139,6 +1141,7 @@ RecordTransactionCommit(void)
 		nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
 													 &RelcacheInitFileInval);
 	wrote_xlog = (XactLastRecEnd != 0);
+	need_twophase = TwoPhaseCommitRequired();
 
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1177,12 +1180,13 @@ RecordTransactionCommit(void)
 		}
 
 		/*
-		 * If we didn't create XLOG entries, we're done here; otherwise we
-		 * should trigger flushing those entries the same as a commit record
+		 * If we didn't create XLOG entries and the transaction does not need
+		 * to be committed using two-phase commit. we're done here; otherwise
+		 * we should trigger flushing those entries the same as a commit record
 		 * would.  This will primarily happen for HOT pruning and the like; we
 		 * want these to be flushed to disk in due time.
 		 */
-		if (!wrote_xlog)
+		if (!wrote_xlog && !need_twophase)
 			goto cleanup;
 	}
 	else
@@ -1340,6 +1344,14 @@ RecordTransactionCommit(void)
 	if (wrote_xlog && markXidCommitted)
 		SyncRepWaitForLSN(XactLastRecEnd, true);
 
+	/*
+	 * Wait for prepared foreign transaction to be resolved, if required.
+	 * We only want to wait if we prepared foreign transaction in this
+	 * transaction.
+	 */
+	if (need_twophase && markXidCommitted)
+		FdwXactWaitForResolve(xid, true);
+
 	/* remember end of last commit record */
 	XactLastCommitEnd = XactLastRecEnd;
 
@@ -1619,6 +1631,14 @@ RecordTransactionAbort(bool isSubXact)
 	if (isSubXact)
 		XidCacheRemoveRunningXids(xid, nchildren, children, latestXid);
 
+	/*
+	 * If we're aborting a main transaction, we wait for prepared foreign
+	 * transaction to be resolved, if required. We only want to wait if
+	 * we're aborting during the preparing foreign transactions.
+	 */
+	if (!isSubXact)
+		FdwXactWaitForResolve(xid, false);
+
 	/* Reset XactLastRecEnd until the next transaction writes something */
 	if (!isSubXact)
 		XactLastRecEnd = 0;
@@ -1977,6 +1997,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FdwXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2132,6 +2155,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_FdwXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2221,6 +2245,8 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	AtPrepare_FdwXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2409,6 +2435,7 @@ PrepareTransaction(void)
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
+	AtEOXact_FdwXacts(true);
 	/* don't call AtEOXact_PgStat here; we fixed pgstat state above */
 	AtEOXact_Snapshot(true, true);
 	pgstat_report_xact_timestamp(0);
@@ -2615,6 +2642,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_FdwXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..64f4ede 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5174,6 +5175,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6261,6 +6263,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
@@ -6962,8 +6967,12 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
+
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7588,6 +7597,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7775,6 +7785,8 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	RecoverFdwXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9101,6 +9113,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	CheckPointFdwXacts(checkPointRedo);
 }
 
 /*
@@ -9538,7 +9551,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9570,6 +9584,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9767,6 +9782,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9956,6 +9972,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index dc40cde..75d6437 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_prepared_foreign_xacts AS
+       SELECT * FROM pg_prepared_foreign_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
@@ -769,6 +772,15 @@ CREATE VIEW pg_stat_subscription AS
             LEFT JOIN pg_stat_get_subscription(NULL) st
                       ON (st.subid = su.oid);
 
+CREATE VIEW pg_stat_fdwxact_resolvers AS
+    SELECT
+            r.pid,
+            r.dbid,
+            r.n_entries,
+            r.last_resolution_time
+    FROM pg_stat_get_fdwxact_resolvers() r
+    WHERE r.pid IS NOT NULL;
+
 CREATE VIEW pg_stat_ssl AS
     SELECT
             S.pid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 9ad9915..b0e1c8d 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,14 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1412,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+						srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 4a3c4b4..1869788 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,6 +16,7 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "access/fdwxact_resolver.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -129,6 +130,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"FdwXactRslvMain", FdwXactRslvMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a0b49c..46fb4ad 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,12 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_FDW_XACT_RESOLUTION:
+			event_name = "FdwXactResolution";
+			break;
+		case WAIT_EVENT_FDW_XACT_RESOLVER_MAIN:
+			event_name = "FdwXactResolver";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 2b2b993..efb6f75 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdwxact_resolver.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/ip.h"
@@ -897,6 +898,10 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
 
+	if (max_prepared_foreign_xacts > 0 && max_foreign_xact_resolvers == 0)
+		ereport(ERROR,
+				(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c..27716b5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -150,6 +150,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_FDW_XACT_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..ac2d7b8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,8 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +152,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FdwXactShmemSize());
+		size = add_size(size, FdwXactResolverShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +274,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FdwXactShmemInit();
+	FdwXactResolverShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..a42d06e 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,5 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+FdwXactLock					46
+FdwXactResolverLock			47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..3d09978 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -35,6 +35,7 @@
 #include <unistd.h>
 #include <sys/time.h>
 
+#include "access/fdwxact.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
@@ -397,6 +398,10 @@ InitProcess(void)
 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
 	SHMQueueElemInit(&(MyProc->syncRepLinks));
 
+	/* initialize fields for fdw xact */
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	SHMQueueElemInit(&(MyProc->fdwXactLinks));
+
 	/* Initialize fields for group XID clearing. */
 	MyProc->procArrayGroupMember = false;
 	MyProc->procArrayGroupMemberXid = InvalidTransactionId;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ae22185..a8f208e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2064,6 +2065,51 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolver_timeout", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Sets the maximum time to wait for foreign transaction resolution."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&foreign_xact_resolver_timeout,
+		60 * 1000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"max_foreign_transaction_resolvers", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Maximum number of foreign transaction resolution processes."),
+			NULL
+		},
+		&max_foreign_xact_resolvers,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolution_interval", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+		 gettext_noop("Sets the maximum interval between resolving foreign transactions."),
+		 NULL,
+		 GUC_UNIT_S
+		},
+		&foreign_xact_resolution_interval,
+		10, 0, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 368b280..a63d36f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -117,6 +117,8 @@
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
 #work_mem = 4MB				# min 64kB
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 27fcf5a..9404506 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,6 +208,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 8cc4fb0..d8a7065 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -287,6 +287,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 25d5547..168cce8 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -672,6 +672,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -894,6 +895,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..2e88496 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -11,6 +11,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdwxact.h b/src/include/access/fdwxact.h
new file mode 100644
index 0000000..79256d3
--- /dev/null
+++ b/src/include/access/fdwxact.h
@@ -0,0 +1,154 @@
+/*
+ * fdwxact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "access/xlogreader.h"
+#include "access/resolver_private.h"
+#include "foreign/foreign.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "storage/backendid.h"
+#include "storage/shmem.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+#define	FDW_XACT_NOT_WAITING		0
+#define	FDW_XACT_WAITING			1
+#define	FDW_XACT_WAIT_COMPLETE		2
+
+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FdwXactEnabled() (max_prepared_foreign_xacts > 0)
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FdwXactData *FdwXact;
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED
+} FdwXactStatus;
+
+typedef struct FdwXactData
+{
+	FdwXact		fx_free_next;	/* Next free FdwXact entry */
+	FdwXact		fx_next;		/* Next FdwXact entry accosiated with the same
+								   transaction */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FdwXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FdwXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FdwXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction identifier */
+}	FdwXactData;
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FdwXactData structs */
+	FdwXact		freeFdwXacts;
+
+	/* Number of valid foreign transaction entries */
+	int			numFdwXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FdwXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FdwXactCtlData;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+ FdwXactCtlData *FdwXactCtl;
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;
+	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+}	FdwXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+/* GUC parameters */
+extern int	max_prepared_foreign_xacts;
+extern int	max_foreign_xact_resolvers;
+extern int	foreign_xact_resolution_interval;
+extern int	foreign_xact_resolver_timeout;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern void RecoverFdwXacts(void);
+extern void FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool can_prepare,
+										 bool modify);
+extern TransactionId PrescanFdwXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void AtEOXact_FdwXacts(bool is_commit);
+extern void AtPrepare_FdwXacts(void);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFdwXacts(XLogRecPtr redo_horizon);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FdwXacts(void);
+extern void FdwXactRedoAdd(XLogReaderState *record);
+extern void FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFdwXactRecreateFiles(XLogRecPtr redo_horizon);
+extern void FdwXactWaitForResolve(TransactionId wait_xid, bool commit);
+extern bool FdwXactResolveForeignTransactions(Oid dbid);
+extern bool FdwXactResolveDanglingTransactions(Oid dbid);
+extern bool TwoPhaseCommitRequired(void);
+
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/fdwxact_resolver.h b/src/include/access/fdwxact_resolver.h
new file mode 100644
index 0000000..7ce1551
--- /dev/null
+++ b/src/include/access/fdwxact_resolver.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_resolver.h
+ *	  PostgreSQL foreign transaction resolver definitions
+ *
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_resolver.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_RESOLVER_H
+#define FDWXACT_RESOLVER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactRslvMain(Datum main_arg);
+extern Size FdwXactResolverShmemSize(void);
+extern void FdwXactResolverShmemInit(void);
+
+extern void fdwxact_resolver_attach(int slot);
+extern void fdwxact_maybe_launch_resolver(void);
+
+extern int foreign_xact_resolver_timeout;
+
+#endif		/* FDWXACT_RESOLVER_H */
diff --git a/src/include/access/resolver_private.h b/src/include/access/resolver_private.h
new file mode 100644
index 0000000..c3c26ba
--- /dev/null
+++ b/src/include/access/resolver_private.h
@@ -0,0 +1,60 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver_private.h
+ *	  Private definitions from access/transam/fdwxact/resolver.c
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/resolver_private.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef _RESOLVER_PRIVATE_H
+#define _RESOLVER_PRIVATE_H
+
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "utils/timestamp.h"
+
+/*
+ * Each foreign transaction resolver has a FdwXactResolver struct in
+ * shared memory.  This struct is protected by FdwXactResolverLaunchLock.
+ */
+typedef struct FdwXactResolver
+{
+	pid_t	pid;	/* this resolver's PID, or 0 if not active */
+	Oid		dbid;	/* database oid */
+
+	/* Indicates if this slot is used of free */
+	bool	in_use;
+
+	/* The number of tasks this resolver has */
+	int		num_entries;
+
+	/* Stats */
+	TimestampTz	last_resolution_time;
+
+	/*
+	 * Pointer to the resolver's patch. Used by backends to wake up this
+	 * resolver when it has work to do. NULL if the resolver isn't active.
+	 */
+	Latch	*latch;
+} FdwXactResolver;
+
+/* There is one FdwXactRslvCtlData struct for the whole database cluster */
+typedef struct FdwXactRslvCtlData
+{
+	/*
+	 * Foreign transaction resolution queue. Protected by FdwXactLock.
+	 */
+	SHM_QUEUE	FdwXactQueue;
+
+	FdwXactResolver resolvers[FLEXIBLE_ARRAY_MEMBER];
+} FdwXactRslvCtlData;
+
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+extern FdwXactResolver *MyFdwXactResolver;
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+#endif	/* _RESOLVER_PRIVATE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 2f43c19..62702de 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 54dec4e..3af7858 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -35,6 +35,7 @@ extern void PostPrepare_Twophase(void);
 
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid);
+extern bool	TwoPhaseExists(TransactionId xid);
 
 extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 22a8e63..54114ae 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -227,6 +227,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3fed3b6..3189eda 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -179,6 +179,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 93c031a..4b4cdda 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f
 DESCR("statistics: information about WAL receiver");
 DATA(insert OID = 6118 (  pg_stat_get_subscription	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}" "{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}" _null_ _null_ pg_stat_get_subscription _null_ _null_ _null_ ));
 DESCR("statistics: information about subscription");
+DATA(insert OID = 4101 (  pg_stat_get_fdwxact_resolvers	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{26,26,26,1184}" "{o,o,o,o}" "{pid,dbid,n_entries,last_resolution_time}" _null_ _null_ pg_stat_get_fdwxact_resolvers _null_ _null_ _null_ ));
+DESCR("statistics: information about subscription");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
 DESCR("statistics: current backend PID");
 DATA(insert OID = 1937 (  pg_stat_get_backend_pid		PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_ ));
@@ -4204,6 +4206,15 @@ DESCR("get the available time zone names");
 DATA(insert OID = 2730 (  pg_get_triggerdef		PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_triggerdef_ext _null_ _null_ _null_ ));
 DESCR("trigger description with pretty-print option");
 
+/* foreign transactions */
+DATA(insert OID = 4099 ( pg_prepared_foreign_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4100 ( pg_fdw_xact_remove PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_fdw_xact_remove _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
+DATA(insert OID = 4126 ( pg_resolve_foreign_xacts	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_resolve_foreign_xacts _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+
+
 /* asynchronous notifications */
 DATA(insert OID = 3035 (  pg_listening_channels PGNSP PGUID 12 1 10 0 0 f f f f t t s r 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pg_listening_channels _null_ _null_ _null_ ));
 DESCR("get the channels that the current backend listens to");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..bbadc2ec 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -162,6 +162,18 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
 
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+										int *prep_info_len);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, const char *prep_id);
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															const char *prep_id);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -226,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for distributed transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..5e7ae7d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,9 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_FDW_XACT_RESOLUTION,
+	WAIT_EVENT_FDW_XACT_RESOLVER_MAIN
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 205f484..24fc4dc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -150,6 +150,16 @@ struct PGPROC
 	SHM_QUEUE	syncRepLinks;	/* list link if process is in syncrep queue */
 
 	/*
+	 * Info to allow us to wait for foreign transaction to be resolved, if
+	 * needed.
+	 */
+	TransactionId	waitXid;	/* waiting for foreign transaction involved with
+								 * this transaction id to be resolved */
+	int			fdwXactState;	/* wait state for foreign transaction
+								 * resolution */
+	SHM_QUEUE	fdwXactLinks;	/* list link if process is in queue */
+
+	/*
 	 * All PROCLOCK objects for locks held or awaited by this backend are
 	 * linked into one of these lists, according to the partition number of
 	 * their lock.
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index 142a1b8..1b28f3c 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -21,4 +21,4 @@ check:
 clean distclean maintainer-clean:
 	rm -rf tmp_check
 
-EXTRA_INSTALL = contrib/test_decoding
+EXTRA_INSTALL = contrib/test_decoding contrib/postgres_fdw
diff --git a/src/test/recovery/t/014_fdwxact.pl b/src/test/recovery/t/014_fdwxact.pl
new file mode 100644
index 0000000..8fa8241
--- /dev/null
+++ b/src/test/recovery/t/014_fdwxact.pl
@@ -0,0 +1,174 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
+max_foreign_transaction_resolvers = 2
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+							   has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs2->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign servers on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE EXTENSION postgres_fdw
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+));
+
+# Create user mapping on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+));
+
+# Create tables on foreign nodes and import them to the master node
+$node_fs1->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 (c int);
+));
+$node_fs2->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 (c int);
+));
+$node_master->safe_psql('postgres', qq(
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE l_table (c int);
+));
+
+# Switch to synchronous replication
+$node_master->safe_psql('postgres', qq(
+ALTER SYSTEM SET synchronous_standby_names ='*';
+));
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node. Check if we can commit and rollback the foreign transactions
+# after the normal recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->stop;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after recovery');
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node immediately. Check if we can commit and rollback the foreign
+# transactions after the crash recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the crash recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after crash recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after crash recovery');
+
+#
+# Commit transaction involving foreign servers and shutdown the master node
+# immediately before checkpoint. Check that WAL replay cleans up
+# its shared memory state release locks while replaying transaction commit.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (3);
+INSERT INTO t2 VALUES (3);
+COMMIT;
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', qq(
+SELECT count(*) FROM pg_prepared_foreign_xacts;
+));
+is($result, 0, "Cleanup of shared memory state for foreign transactions");
+
+#
+# Check if the standby node can process prepard foreign transaction
+# after promotion.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (4);
+INSERT INTO t2 VALUES (4);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (5);
+INSERT INTO t2 VALUES (5);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', qq(COMMIT PREPARED 'gxid1';));
+is($result, 0, 'Commit foreign transaction after promotion');
+$result = $node_standby->psql('postgres', qq(ROLLBACK PREPARED 'gxid2';));
+is($result, 0, 'Rollback foreign transaction after promotion');
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..11b0b91 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1413,6 +1413,13 @@ pg_policies| SELECT n.nspname AS schemaname,
    FROM ((pg_policy pol
      JOIN pg_class c ON ((c.oid = pol.polrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
+pg_prepared_foreign_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_prepared_foreign_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_prepared_statements| SELECT p.name,
     p.statement,
     p.prepare_time,
@@ -1819,6 +1826,12 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
+pg_stat_fdwxact_resolvers| SELECT r.pid,
+    r.dbid,
+    r.n_entries,
+    r.last_resolution_time
+   FROM pg_stat_get_fdwxact_resolvers() r(pid, dbid, n_entries, last_resolution_time)
+  WHERE (r.pid IS NOT NULL);
 pg_stat_progress_vacuum| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 0156b00..fbf3057 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2292,9 +2292,12 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts.  We also set max_prepared_foreign_transactions and
+		 * max_foreign_transaction_resolvers to enable testing of transaction
+		 * involving multiple foreign servers. (Note: to reduce the probability
+		 * of unexpected shmmax failures, don't set max_prepared_transactions
+		 * any higher than actually needed by the prepared_xacts regression
+		 * test.)
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2309,7 +2312,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
+		fputs("max_foreign_transaction_resolvers = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{
-- 
1.8.3.1

0001-Keep-track-of-local-writes_v13.patchtext/x-patch; charset=US-ASCII; name=0001-Keep-track-of-local-writes_v13.patchDownload

From b4482ca85151298f086a52e3977be8f949741bac Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 15:50:22 +0900
Subject: [PATCH 1/3] Keep track of local writes.

---
 src/backend/access/heap/heapam.c  | 10 ++++++++++
 src/backend/access/transam/xact.c | 27 +++++++++++++++++++++++++++
 src/include/access/xact.h         |  3 +++
 3 files changed, 40 insertions(+)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 52dda41..fafed0c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2496,6 +2496,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+	/* Remember to write on local node for foreign transaction */
+	RegisterTransactionLocalNode();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -3305,6 +3308,9 @@ l1:
 	 */
 	MultiXactIdSetOldestMember();
 
+	/* Remember to write on local node for foreign transaction */
+	RegisterTransactionLocalNode();
+
 	compute_new_xmax_infomask(HeapTupleHeaderGetRawXmax(tp.t_data),
 							  tp.t_data->t_infomask, tp.t_data->t_infomask2,
 							  xid, LockTupleExclusive, true,
@@ -4210,6 +4216,10 @@ l2:
 	 */
 	CheckForSerializableConflictIn(relation, &oldtup, buffer);
 
+
+	/* Remember to write on local node for foreign transaction */
+	RegisterTransactionLocalNode();
+
 	/*
 	 * At this point newbuf and buffer are both pinned and locked, and newbuf
 	 * has enough space for the new tuple.  If they are the same buffer, only
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 8203388..52e1286 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -117,6 +117,9 @@ TransactionId *ParallelCurrentXids;
  */
 int			MyXactFlags;
 
+/* Transaction do the write on local node */
+bool		XactWriteLocalNode = false;
+
 /*
  *	transaction states - transaction state from server perspective
  */
@@ -2150,6 +2153,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	ForgetTransactionLocalNode();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2427,6 +2432,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	ForgetTransactionLocalNode();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2611,6 +2618,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	ForgetTransactionLocalNode();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4441,6 +4450,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RegisterTransactionLocalNode --- remember to write on local node
+ */
+void
+RegisterTransactionLocalNode(void)
+{
+	XactWriteLocalNode = true;
+}
+
+/*
+ * ForgetTransactionLocalNode --- forget to write on local node
+ */
+void
+ForgetTransactionLocalNode(void)
+{
+	XactWriteLocalNode = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index f2c10f9..3e7b054 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -91,6 +91,7 @@ extern int	MyXactFlags;
  */
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK	(1U << 1)
 
+extern bool XactWriteLocalNode;
 
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
@@ -377,6 +378,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RegisterTransactionLocalNode(void);
+extern void ForgetTransactionLocalNode(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
-- 
1.8.3.1

#157

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: Masahiko Sawada (#156)

Re: Transactions involving multiple postgres foreign servers

On Wed, Oct 25, 2017 at 3:15 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Foreign Transaction Resolver
======================
I introduced a new background worker called "foreign transaction
resolver" which is responsible for resolving the transaction prepared
on foreign servers. The foreign transaction resolver process is
launched by backend processes when commit/rollback transaction. And it
periodically resolves the queued transactions on a database as long as
the queue is not empty. If the queue has been empty for the certain
time specified by foreign_transaction_resolver_time GUC parameter, it
exits. It means that the backend doesn't launch a new resolver process
if the resolver process is already working. In this case, the backend
process just adds the entry to the queue on shared memory and wake it
up. The maximum number of resolver process we can launch is controlled
by max_foreign_transaction_resolvers. So we recommends to set larger
max_foreign_transaction_resolvers value than the number of databases.
The resolver process also tries to resolve dangling transaction as
well in a cycle.

Processing Sequence
=================
I've changed the processing sequence of resolving foreign transaction
so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK
prepared) is executed by a resolver process, not by backend process.
The basic processing sequence is following;

* Backend process
1. In pre-commit phase, the backend process saves fdwxact entries, and
then prepares transaction on all foreign servers that can execute
two-phase commit protocol.
2. Local commit.
3. Enqueue itself to the shmem queue and change its status to WAITING
4. launch or wakeup a resolver process and wait

* Resolver process
1. Dequeue the waiting process from shmem qeue
2. Collect the fdwxact entries that are associated with the waiting process.
3. Resolve foreign transactoins
4. Release the waiting process

Why do we want the the backend to linger behind, once it has added its
foreign transaction entries in the shared memory and informed resolver
about it? The foreign connections may take their own time and even
after that there is no guarantee that the foreign transactions will be
resolved in case the foreign server is not available. So, why to make
the backend wait?

5. Wake up and restart

This is still under the design phase and I'm sure that there is room
for improvement and consider more sensitive behaviour but I'd like to
share the current status of the patch. The patch includes regression
tests but not includes fully documentation.

Any background worker, backend should be child of the postmaster, so
we should not let a backend start a resolver process. It should be the
job of the postmaster.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#158

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Ashutosh Bapat (#157)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 26, 2017 at 2:36 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Oct 25, 2017 at 3:15 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Foreign Transaction Resolver
======================
I introduced a new background worker called "foreign transaction
resolver" which is responsible for resolving the transaction prepared
on foreign servers. The foreign transaction resolver process is
launched by backend processes when commit/rollback transaction. And it
periodically resolves the queued transactions on a database as long as
the queue is not empty. If the queue has been empty for the certain
time specified by foreign_transaction_resolver_time GUC parameter, it
exits. It means that the backend doesn't launch a new resolver process
if the resolver process is already working. In this case, the backend
process just adds the entry to the queue on shared memory and wake it
up. The maximum number of resolver process we can launch is controlled
by max_foreign_transaction_resolvers. So we recommends to set larger
max_foreign_transaction_resolvers value than the number of databases.
The resolver process also tries to resolve dangling transaction as
well in a cycle.

Processing Sequence
=================
I've changed the processing sequence of resolving foreign transaction
so that the second phase of two-phase commit protocol (COMMIT/ROLLBACK
prepared) is executed by a resolver process, not by backend process.
The basic processing sequence is following;

* Backend process
1. In pre-commit phase, the backend process saves fdwxact entries, and
then prepares transaction on all foreign servers that can execute
two-phase commit protocol.
2. Local commit.
3. Enqueue itself to the shmem queue and change its status to WAITING
4. launch or wakeup a resolver process and wait

* Resolver process
1. Dequeue the waiting process from shmem qeue
2. Collect the fdwxact entries that are associated with the waiting process.
3. Resolve foreign transactoins
4. Release the waiting process

Why do we want the the backend to linger behind, once it has added its
foreign transaction entries in the shared memory and informed resolver
about it? The foreign connections may take their own time and even
after that there is no guarantee that the foreign transactions will be
resolved in case the foreign server is not available. So, why to make
the backend wait?

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

5. Wake up and restart

This is still under the design phase and I'm sure that there is room
for improvement and consider more sensitive behaviour but I'd like to
share the current status of the patch. The patch includes regression
tests but not includes fully documentation.

Any background worker, backend should be child of the postmaster, so
we should not let a backend start a resolver process. It should be the
job of the postmaster.

Of course I won't. I used the term of "the backend process launches
the resolver process" for explaining easier. Sorry for confusing you.
The backend process calls RegisterDynamicBackgroundWorker() function
to launch a resolver process, so they are launched by postmaster.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#159

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#158)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 26, 2017 at 4:11 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Why do we want the the backend to linger behind, once it has added its
foreign transaction entries in the shared memory and informed resolver
about it? The foreign connections may take their own time and even
after that there is no guarantee that the foreign transactions will be
resolved in case the foreign server is not available. So, why to make
the backend wait?

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions.

Right, this is very important, and having the backend wait for the
resolver(s) is, I think, the right way to implement it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#160

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: Masahiko Sawada (#158)

Re: Transactions involving multiple postgres foreign servers

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#161

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Ashutosh Bapat (#160)

Re: Transactions involving multiple postgres foreign servers

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#162

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Masahiko Sawada (#161)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#163

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Antonin Houska (#162)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

For the concurrent reads, the reading an inconsistent result can be
happen even without in-doubt transaction because we can read data on a
foreign server between PREPARE and COMMIT PREPARED while other foreign
servers have committed. I think we should deal with this problem by
other feature on reader side, for example, atomic visibility. If we
have atomic visibility feature, we also can solve the above problem.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#164

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#163)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Mon, Nov 27, 2017 at 4:35 PM Masahiko Sawada <sawada.mshk@gmail.com>
wrote:

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

For the concurrent reads, the reading an inconsistent result can be
happen even without in-doubt transaction because we can read data on a
foreign server between PREPARE and COMMIT PREPARED while other foreign
servers have committed. I think we should deal with this problem by
other feature on reader side, for example, atomic visibility. If we
have atomic visibility feature, we also can solve the above problem.

+1 to all of that.

...Robert

--

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#165

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: Masahiko Sawada (#163)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

For the concurrent reads, the reading an inconsistent result can be
happen even without in-doubt transaction because we can read data on a
foreign server between PREPARE and COMMIT PREPARED while other foreign
servers have committed. I think we should deal with this problem by
other feature on reader side, for example, atomic visibility. If we
have atomic visibility feature, we also can solve the above problem.

+1.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#166

Antonin Houska

ah@cybertec.at

about 8 years ago

In reply to: Masahiko Sawada (#163)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user.

I missed the point that user should not expect atomicity of the commit to be
guaranteed if he has cancelled his request.

The other things are clear to me, including the fact that atomic commit and
atomic visibility will be implemented separately. Thanks.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#167

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Ashutosh Bapat (#165)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Nov 28, 2017 at 12:31 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Tue, Nov 28, 2017 at 3:04 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Fri, Nov 24, 2017 at 10:28 PM, Antonin Houska <ah@cybertec.at> wrote:

Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Mon, Oct 30, 2017 at 5:48 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Oct 26, 2017 at 7:41 PM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Because I don't want to break the current user semantics. that is,
currently it's guaranteed that the subsequent reads can see the
committed result of previous writes even if the previous transactions
were distributed transactions. And it's ensured by writer side. If we
can make the reader side ensure it, the backend process don't need to
wait for the resolver process.

The waiting backend process are released by resolver process after the
resolver process tried to resolve foreign transactions. Even if
resolver process failed to either connect to foreign server or to
resolve foreign transaction the backend process will be released and
the foreign transactions are leaved as dangling transaction in that
case, which are processed later. Also if resolver process takes a long
time to resolve foreign transactions for whatever reason the user can
cancel it by Ctl-c anytime.

So, there's no guarantee that the next command issued from the
connection *will* see the committed data, since the foreign
transaction might not have committed because of a network glitch
(say). If we go this route of making backends wait for resolver to
resolve the foreign transaction, we will have add complexity to make
sure that the waiting backends are woken up in problematic events like
crash of the resolver process OR if the resolver process hangs in a
connection to a foreign server etc. I am not sure that the complexity
is worth the half-guarantee.

Hmm, maybe I was wrong. I now think that the waiting backends can be
woken up only in following cases;
- The resolver process succeeded to resolve all foreign transactions.
- The user did the cancel (e.g. ctl-c).
- The resolver process failed to resolve foreign transaction for a
reason of there is no such prepared transaction on foreign server.

In other cases the resolver process should not release the waiters.

I'm not sure I see consensus here. What Ashutosh says seems to be: "Special
effort is needed to ensure that backend does not keep waiting if the resolver
can't finish it's work in forseable future. But this effort is not worth
because by waking the backend up you might prevent the next transaction from
seeing the changes the previous one tried to make."

On the other hand, your last comments indicate that you try to be even more
stringent in letting the backend wait. However even this stringent approach
does not guarantee that the next transaction will see the data changes made by
the previous one.

What I'd like to guarantee is that the subsequent read can see the
committed result of previous writes if the transaction involving
multiple foreign servers is committed without cancellation by user. In
other words, the backend should not be waken up and the resolver
should continue to resolve at certain intervals even if the resolver
fails to connect to the foreign server or fails to resolve it. This is
similar to what synchronous replication guaranteed today. Keeping this
semantics is very important for users. Note that the reading a
consistent result by concurrent reads is a separated problem.

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

After more thought, I agree with adding some timeout. I can image
there are users who want the timeout, for example, who cannot accept
even a few seconds latency. If the timeout occurs backend unlocks the
foreign transactions and breaks the loop. The resolver process will
keep to continue to resolve foreign transactions at certain interval.

The read result including foreign servers can be inconsistent if the
such transaction is cancelled or the coordinator server crashes during
two-phase commit processing. That is, if there is in-doubt transaction
the read result can be inconsistent, even for subsequent reads. But I
think this behaviour can be accepted by users. For the resolution of
in-doubt transactions, the resolver process will try to resolve such
transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

Yeah, however the disadvantage of this is that we manage foreign
transactions per foreign servers. If a transaction that modified even
one table is remained as a in-doubt transaction, we cannot issue any
SQL that touches that foreign server. Can we occur an error at
ExecInitForeignScan()?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#168

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#167)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

After more thought, I agree with adding some timeout. I can image
there are users who want the timeout, for example, who cannot accept
even a few seconds latency. If the timeout occurs backend unlocks the
foreign transactions and breaks the loop. The resolver process will
keep to continue to resolve foreign transactions at certain interval.

I don't think a timeout is a very good idea. There is no timeout for
synchronous replication and the issues here are similar. I will not
try to block a patch adding a timeout, but I think it had better be
disabled by default and have very clear documentation explaining why
it's really dangerous. And this is why: with no timeout, you can
count on being able to see the effects of your own previous
transactions, unless at some point you sent a query cancel or got
disconnected. With a timeout, you may or may not see the effects of
your own previous transactions depending on whether or not you hit the
timeout, which you have no sure way of knowing.

transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

Yeah, however the disadvantage of this is that we manage foreign
transactions per foreign servers. If a transaction that modified even
one table is remained as a in-doubt transaction, we cannot issue any
SQL that touches that foreign server. Can we occur an error at
ExecInitForeignScan()?

I really feel strongly we shouldn't complicate the initial patch with
this kind of thing. Let's make it enough for this patch to guarantee
that either all parts of the transaction commit eventually or they all
abort eventually. Ensuring consistent visibility is a different and
hard project, and if we try to do that now, this patch is not going to
be done any time soon.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#169

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#168)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Wed, Dec 13, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

After more thought, I agree with adding some timeout. I can image
there are users who want the timeout, for example, who cannot accept
even a few seconds latency. If the timeout occurs backend unlocks the
foreign transactions and breaks the loop. The resolver process will
keep to continue to resolve foreign transactions at certain interval.

I don't think a timeout is a very good idea. There is no timeout for
synchronous replication and the issues here are similar. I will not
try to block a patch adding a timeout, but I think it had better be
disabled by default and have very clear documentation explaining why
it's really dangerous. And this is why: with no timeout, you can
count on being able to see the effects of your own previous
transactions, unless at some point you sent a query cancel or got
disconnected. With a timeout, you may or may not see the effects of
your own previous transactions depending on whether or not you hit the
timeout, which you have no sure way of knowing.

transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

Yeah, however the disadvantage of this is that we manage foreign
transactions per foreign servers. If a transaction that modified even
one table is remained as a in-doubt transaction, we cannot issue any
SQL that touches that foreign server. Can we occur an error at
ExecInitForeignScan()?

I really feel strongly we shouldn't complicate the initial patch with
this kind of thing. Let's make it enough for this patch to guarantee
that either all parts of the transaction commit eventually or they all
abort eventually. Ensuring consistent visibility is a different and
hard project, and if we try to do that now, this patch is not going to
be done any time soon.

Thank you for the suggestion.

I was really wondering if we should add a timeout to this feature.
It's a common concern that we want to put a timeout at critical
section. But currently we don't have such timeout to neither
synchronous replication or writing WAL. I can image there will be
users who want to a timeout for such cases but obviously it makes this
feature more complex. Anyway, even if we add a timeout to this feature
we can make it as a separated patch and feature. So I'd like to keep
it simple as first step. This patch guarantees that the transaction
commit or rollback on all foreign servers or not unless users doesn't
cancel.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#170

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Masahiko Sawada (#169)

3 attachment(s)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Wed, Dec 13, 2017 at 10:47 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Wed, Dec 13, 2017 at 12:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Dec 11, 2017 at 5:20 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The question I have is how would we deal with a foreign server that is
not available for longer duration due to crash, longer network outage
etc. Example is the foreign server crashed/got disconnected after
PREPARE but before COMMIT/ROLLBACK was issued. The backend will remain
blocked for much longer duration without user having an idea of what's
going on. May be we should add some timeout.

After more thought, I agree with adding some timeout. I can image
there are users who want the timeout, for example, who cannot accept
even a few seconds latency. If the timeout occurs backend unlocks the
foreign transactions and breaks the loop. The resolver process will
keep to continue to resolve foreign transactions at certain interval.

I don't think a timeout is a very good idea. There is no timeout for
synchronous replication and the issues here are similar. I will not
try to block a patch adding a timeout, but I think it had better be
disabled by default and have very clear documentation explaining why
it's really dangerous. And this is why: with no timeout, you can
count on being able to see the effects of your own previous
transactions, unless at some point you sent a query cancel or got
disconnected. With a timeout, you may or may not see the effects of
your own previous transactions depending on whether or not you hit the
timeout, which you have no sure way of knowing.

transactions after the coordinator server recovered. On the other
hand, for the reading a consistent result on such situation by
subsequent reads, for example, we can disallow backends to inquiry SQL
to the foreign server if a foreign transaction of the foreign server
is remained.

+1 for the last sentence. If we do that, we don't need the backend to
be blocked by resolver since a subsequent read accessing that foreign
server would get an error and not inconsistent data.

Yeah, however the disadvantage of this is that we manage foreign
transactions per foreign servers. If a transaction that modified even
one table is remained as a in-doubt transaction, we cannot issue any
SQL that touches that foreign server. Can we occur an error at
ExecInitForeignScan()?

I really feel strongly we shouldn't complicate the initial patch with
this kind of thing. Let's make it enough for this patch to guarantee
that either all parts of the transaction commit eventually or they all
abort eventually. Ensuring consistent visibility is a different and
hard project, and if we try to do that now, this patch is not going to
be done any time soon.

Thank you for the suggestion.

I was really wondering if we should add a timeout to this feature.
It's a common concern that we want to put a timeout at critical
section. But currently we don't have such timeout to neither
synchronous replication or writing WAL. I can image there will be
users who want to a timeout for such cases but obviously it makes this
feature more complex. Anyway, even if we add a timeout to this feature
we can make it as a separated patch and feature. So I'd like to keep
it simple as first step. This patch guarantees that the transaction
commit or rollback on all foreign servers or not unless users doesn't
cancel.

Regards,

I've updated documentation of patches, and fixed some bugs. I did some
failure tests of this feature using a fault simulation tool[1]https://github.com/MasahikoSawada/pg_simula for
PostgreSQL that I created.

0001 patch adds a mechanism to track of writes on local server. This
is required to determine whether we should use 2pc at commit. 0002
patch is the main part. It adds a distributed transaction manager
(currently only for atomic commit), APIs for 2pc and foreign
transaction manager resolver process. 0003 patch makes postgres_fdw
support atomic commit using 2pc.

Please review patches.

[1]: https://github.com/MasahikoSawada/pg_simula

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

0001-Keep-track-of-local-writes_v14.patchapplication/octet-stream; name=0001-Keep-track-of-local-writes_v14.patchDownload

From ebe5867d1773ae49f0e436639d9777a63a7121a2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 15:50:22 +0900
Subject: [PATCH 1/3] Keep track of local writes.

---
 src/backend/access/heap/heapam.c  |    9 +++++++++
 src/backend/access/transam/xact.c |   27 +++++++++++++++++++++++++++
 src/include/access/xact.h         |    4 ++++
 3 files changed, 40 insertions(+), 0 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 54f1100..0e28e1d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2453,6 +2453,9 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
 
+	/* Remember to write on local node */
+	RememberTransactionDidWrite();
+
 	/* NO EREPORT(ERROR) from here till changes are logged */
 	START_CRIT_SECTION();
 
@@ -3262,6 +3265,9 @@ l1:
 	 */
 	MultiXactIdSetOldestMember();
 
+	/* Remember to write on local node */
+	RememberTransactionDidWrite();
+
 	compute_new_xmax_infomask(HeapTupleHeaderGetRawXmax(tp.t_data),
 							  tp.t_data->t_infomask, tp.t_data->t_infomask2,
 							  xid, LockTupleExclusive, true,
@@ -4167,6 +4173,9 @@ l2:
 	 */
 	CheckForSerializableConflictIn(relation, &oldtup, buffer);
 
+	/* Remember to write on local node */
+	RememberTransactionDidWrite();
+
 	/*
 	 * At this point newbuf and buffer are both pinned and locked, and newbuf
 	 * has enough space for the new tuple.  If they are the same buffer, only
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b37510c..e795a2f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -117,6 +117,9 @@ TransactionId *ParallelCurrentXids;
  */
 int			MyXactFlags;
 
+/* True means transaction did any writes */
+bool		TransactionDidWrite = false;
+
 /*
  *	transaction states - transaction state from server perspective
  */
@@ -2151,6 +2154,8 @@ CommitTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	ForgetTransactionDidWrite();
+
 	/*
 	 * done with commit processing, set current transaction state back to
 	 * default
@@ -2428,6 +2433,8 @@ PrepareTransaction(void)
 	XactTopTransactionId = InvalidTransactionId;
 	nParallelCurrentXids = 0;
 
+	ForgetTransactionDidWrite();
+
 	/*
 	 * done with 1st phase commit processing, set current transaction state
 	 * back to default
@@ -2612,6 +2619,8 @@ AbortTransaction(void)
 		pgstat_report_xact_timestamp(0);
 	}
 
+	ForgetTransactionDidWrite();
+
 	/*
 	 * State remains TRANS_ABORT until CleanupTransaction().
 	 */
@@ -4442,6 +4451,24 @@ AbortOutOfAnyTransaction(void)
 }
 
 /*
+ * RememberTransactionDidWrite --- remember a transaction did any writes
+ */
+void
+RememberTransactionDidWrite(void)
+{
+	TransactionDidWrite = true;
+}
+
+/*
+ * ForgetTransactionDidWrite --- Forget a transaction did any writes
+ */
+void
+ForgetTransactionDidWrite(void)
+{
+	TransactionDidWrite = false;
+}
+
+/*
  * IsTransactionBlock --- are we within a transaction block?
  */
 bool
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 118b0a8..63491de 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -91,6 +91,8 @@ extern int	MyXactFlags;
  */
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK	(1U << 1)
 
+/* True means transaction did any writes */
+extern bool TransactionDidWrite;
 
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
@@ -377,6 +379,8 @@ extern void RegisterXactCallback(XactCallback callback, void *arg);
 extern void UnregisterXactCallback(XactCallback callback, void *arg);
 extern void RegisterSubXactCallback(SubXactCallback callback, void *arg);
 extern void UnregisterSubXactCallback(SubXactCallback callback, void *arg);
+extern void RememberTransactionDidWrite(void);
+extern void ForgetTransactionDidWrite(void);
 
 extern int	xactGetCommittedChildren(TransactionId **ptr);
 
-- 
1.7.1

0002-Support-atomic-commit-involving-multiple-foreign-ser_v14.patchapplication/octet-stream; name=0002-Support-atomic-commit-involving-multiple-foreign-ser_v14.patchDownload

From 6643ac7c43161726bc78bbbadd782cd0f7159c70 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 16:05:35 +0900
Subject: [PATCH 2/3] Support atomic commit involving multiple foreign servers.

---
 doc/src/sgml/catalogs.sgml                    |   97 +
 doc/src/sgml/config.sgml                      |   89 +
 doc/src/sgml/fdwhandler.sgml                  |  142 ++
 doc/src/sgml/func.sgml                        |   47 +
 doc/src/sgml/monitoring.sgml                  |   44 +
 src/backend/access/rmgrdesc/Makefile          |    8 +-
 src/backend/access/rmgrdesc/fdwxactdesc.c     |   68 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    6 +-
 src/backend/access/transam/Makefile           |    6 +-
 src/backend/access/transam/fdwxact.c          | 2448 +++++++++++++++++++++++++
 src/backend/access/transam/fdwxact_resolver.c |  522 ++++++
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |   42 +
 src/backend/access/transam/xact.c             |   26 +-
 src/backend/access/transam/xlog.c             |   19 +-
 src/backend/catalog/system_views.sql          |   11 +
 src/backend/commands/foreigncmds.c            |   20 +
 src/backend/postmaster/bgworker.c             |    4 +
 src/backend/postmaster/pgstat.c               |    6 +
 src/backend/postmaster/postmaster.c           |    5 +
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlocknames.txt      |    2 +
 src/backend/storage/lmgr/proc.c               |    5 +
 src/backend/utils/misc/guc.c                  |   46 +
 src/backend/utils/misc/postgresql.conf.sample |    2 +
 src/backend/utils/probes.d                    |    2 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetwal/pg_resetwal.c             |    2 +
 src/bin/pg_waldump/rmgrdesc.c                 |    1 +
 src/include/access/fdwxact.h                  |  153 ++
 src/include/access/fdwxact_resolver.h         |   27 +
 src/include/access/resolver_private.h         |   61 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/twophase.h                 |    1 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.h                 |   11 +
 src/include/foreign/fdwapi.h                  |   18 +
 src/include/pgstat.h                          |    4 +-
 src/include/storage/proc.h                    |   10 +
 src/test/recovery/Makefile                    |    2 +-
 src/test/recovery/t/014_fdwxact.pl            |  176 ++
 src/test/regress/expected/rules.out           |   12 +
 src/test/regress/pg_regress.c                 |   13 +-
 46 files changed, 4153 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/fdwxactdesc.c
 create mode 100755 src/backend/access/transam/fdwxact.c
 create mode 100644 src/backend/access/transam/fdwxact_resolver.c
 create mode 100644 src/include/access/fdwxact.h
 create mode 100644 src/include/access/fdwxact_resolver.h
 create mode 100644 src/include/access/resolver_private.h
 create mode 100644 src/test/recovery/t/014_fdwxact.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 3f02202..b6454e0 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9535,6 +9535,103 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  </sect1>
 
+ <sect1 id="view-pg-prepared-fdw-xacts">
+  <title><structname>pg_prepared_fdw_xacts</structname></title>
+
+  <indexterm zone="view-pg-prepared-fdw-xacts">
+   <primary>pg_prepared_fdw_xacts</primary>
+  </indexterm>
+
+  <para>
+   The view <structname>pg_prepared_fdw_xacts</structname> displays
+   information about foreign transactions that are currently prepared on
+   foreign servers for atomic distributed transaction commit (see
+   <xref linkend="fdw-transactions"/> for details).
+  </para>
+
+  <para>
+   <structname>pg_prepared_xacts</structname> contains one row per prepared
+   foreign transaction.  An entry is removed when the foreign transaction is
+   committed or rolled back.
+  </para>
+
+  <table>
+   <title><structname>pg_prepared_fdw_xacts</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><structfield>dbid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-database"><structname>pg_database</structname></link>.oid</literal></entry>
+      <entry>
+       OID of the database which the foreign transaction resides in
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>transaction</structfield></entry>
+      <entry><type>xid</type></entry>
+      <entry></entry>
+      <entry>
+       Transaction id that this foreign transaction associates with
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>serverid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-foreign-server"><structname>pg_foreign_server</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the foreign server that this foreign server is prepared
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>userid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="view-pg-user"><structname>pg_user</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the user that prepared this foreign transaction.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       Status of foreign transaction: <literal>prepared</literal>, <literal>committing</literal>, <literal>aborting</literal> or <literal>unknown</literal>
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>identifier</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       The identifier of the prepared foreign transaction.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   When the <structname>pg_prepared_xacts</structname> view is accessed, the
+   internal transaction manager data structures are momentarily locked, and
+   a copy is made for the view to display.  This ensures that the
+   view produces a consistent set of results, while not blocking
+   normal operations longer than necessary.  Nonetheless
+   there could be some impact on database performance if this view is
+   frequently accessed.
+  </para>
+
+ </sect1>
+
  <sect1 id="view-pg-publication-tables">
   <title><structname>pg_publication_tables</structname></title>
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e4a0169..753daac 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1451,6 +1451,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously. This parameter can only be set at server start.
+       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
@@ -3467,6 +3486,76 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-foregin-transaction-resolver">
+     <title>Foreign Transaction Resolvers</title>
+
+     <para>
+      These settings control the behavior of a foreign transaction resolver.
+     </para>
+
+     <variablelist>
+
+     <varlistentry id="guc-max-foreign-transaction-resolvers" xreflabel="max_foreign_transaction_resolvers">
+      <term><varname>max_foreign_transaction_resolvers</varname> (<type>int</type>)
+      <indexterm>
+       <primary><varname>max_foreign_transaction_resolvers</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies maximum number of foreign transaction resolution workers. A foreign transaction
+        resolution worker is responsible for a database.
+       </para>
+       <para>
+        Foreign transaction resolution workers are taken from the pool defined by
+        <varname>max_worker_processes</varname>.
+       </para>
+       <para>
+        The default value is 0.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolution-interval" xreflabel="foreign_transaction_resolution_intervalription">
+      <term><varname>foreign_transaction_resolution_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolution_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify how long the foreign transaction resolver should wait before trying to resolve
+        foreign transaction. This parameter can only be set in the
+        <filename>postgresql.conf</filename> file or on the server command line.
+       </para>
+       <para>
+        The default value is 10 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolver-timeout" xreflabel="foreign_transaction_resolver_timeout">
+      <term><varname>foreign_transaction_resolver_timeout</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolver_timeout</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Terminate foreign transaction resolver processes that don't have any foreign
+        transactions to resolve longer than the specified number of milliseconds.
+        A value of zero disables the timeout mechanism.  This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server command line.
+       </para>
+       <para>
+        The default value is 60 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
    </sect1>
 
    <sect1 id="runtime-config-query">
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0ed3a47..fc6f7ec 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1315,6 +1315,61 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
     </para>
    </sect2>
 
+   <sect2 id="fdw-callbacks-transaction-managements">
+    <title>FDW Routines For transaction managements</title>
+
+    <para>
+     If an FDW wishes to support <firstterm>atomic distributed transaction commit</firstterm>
+     (as described in <xref linkend="fdw-transactions"/>), it must provide
+     the following callback functions:
+    </para>
+
+    <para>
+<programlisting>
+char *
+GetPreparedId(Oid serverid, Oid userid, int *prep_info_len);
+</programlisting>
+    Generate prepared transaction identifier. Note that the transaction
+    identifier must be string literal, less than 200 bytes long and
+    should not be same as any other concurrent prepared transaction
+    Id.
+    </para>
+    <para>
+<programlisting>
+bool
+PrepareForignTransaction(Oid serverid, Oid userid, Oid unmid,
+                         const char *prep_id);
+</programlisting>
+    Prepare foreign transaction on foreign server. This function is called
+    before local transaction commit at the pre-commit phase of the local
+    transaction. Returning true means that prepareing the foreign transaction
+    got successful.
+    </para>
+    <para>
+<programlisting>
+bool
+EndForeignTransaction(Oid serverid, Oid userid, Oid unmid,
+                      const char *prep_id);
+</programlisting>
+    Commit or rollback the foreign transaction on foreign server. For foreign
+    servers that supports two-phase commit protocol, this function is called
+    both at the pre-commit phase of the local transaction when committing
+    and at the end of the local transaction when aborting. For foreign servers
+    that don't support two-phase commit protocol, this function is called
+    at the pre-commit phase of the local tranasaction.
+    </para>
+    <para>
+<programlisting>
+bool
+ResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+                                  Oid umid, bool is_commit,
+                                  const char *prep_id)l
+</programlisting>
+    Commit or rollback the prepared foreign transaction on foreign server.
+    This function is called both by foreign transaction resolver process after
+    prepared foreign transaction and by <function>pg_resolve_fdw_xacts</function>
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
@@ -1760,4 +1815,91 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+  <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</productname> transaction manager allows FDWs to read
+    and write data on foreign server within a transaction while maintaining atomicity
+    (and hence consistency) of the foreign data. Every Foreign Data Wrapper is
+    required to register the foreign server along with the <productname>PostgreSQL</productname>
+    user whose user mapping is used to connect to the foreign server while starting a
+    transaction on the foreign server as part of the transaction on
+    <productname>PostgreSQL</productname> using <function>RegisterXactForeignServer</function>.
+<programlisting>
+void
+FdwXactRegisterForeignServer(Oid serverid,
+                             Oid userid,
+                             bool can _prepare,
+                             bool modify)
+</programlisting>
+    <varname>can_prepare</varname> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise. <varname>modify</varname> should be
+    true if you're attempting to modify data on foreign server in current transaction.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</varname> is more than zero
+    <productname>PostgreSQL</productname> employs Two-phase commit protocol to
+    achieve atomic distributed transaction commit. All the foreign servers registered
+    should support two-phase commit protocol. The two-phase commit protocol is
+    used for achieving atomic distributed transaction commit when more than two foreign
+    servers that support two-phase commit protocol are involved with transaction,
+    or when transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit protocol is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol
+    the commit is processed in two phases: prepare phase and commit phase.
+    In prepare phase, <productname>PostgreSQL</productname> prepares the transactions
+    on all the foreign servers registered using
+    <function>FdwXactPrepareForeignTransactions</function>. If any of the foreign
+    server fails to prepare transaction, prepare phase fails. In commit phase,
+    all the prepared transactions are committed by foreign transaction resolver
+    process if prepare phase has succeeded or rolled back if prepare phase fails
+    to prepare transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</function> to get the prepared transaction
+    identifier for each foreign server involved. It stores this identifier along
+    with the serverid, userid and user mapping id for later use. It then calls
+    <function>PrepareForeignTransaction</function> with the same identifier.
+    </para>
+
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</function> with the same identifier with
+    action FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_resolve_fdw_xacts</function>, or by foreign
+    transaction resolver process if it's working.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</varname> is zero, atomicity
+    commit can not be guaranteed across foreign servers. If transaction on
+    <productname>PostgreSQL</productname> is committed, distributed transaction
+    manager commit the transaction on all the foreign servers registered using
+    <function>FdwXactRegisterForeignServer</function>, independent of the outcome
+    of the same operation on other foreign servers. Thus transactions on some
+    foreign servers may be committed, while the same on other foreign servers
+    would be rolled back. If the transaction on <productname>PostgreSQL</productname>
+    aborts transactions on all the foreign servers are aborted too.
+    </para>
+  </sect1>
  </chapter>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 4dd9d02..bc1f71b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20342,6 +20342,53 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-fdw-transaction">
+   <title>Foreign Transaction Management Functions</title>
+
+   <indexterm>
+    <primary>pg_resolve_fdw_xacts</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_remove_fdw_xacts</primary>
+   </indexterm>
+
+   <para>
+    <xref linkend="functions-fdw-transaction-table"/> shows the functions
+    available for foreign transaction managements.
+    These functions cannot be executed during recovery. Use of these function
+    is restricted to superusers.
+   </para>
+
+   <table id="functions-fdw-transaction-table">
+    <title>Foreign Transaction Management Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_resolve_fdw_xacts(<parameter>void</parameter>)</function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>Resolve all foreign transaction in connecting database</entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_remove_fdw_xacts(<parameter>xid</parameter> <type>xid</type>, <parameter>dbid</parameter> <type>oid</type>, <parameter>serverid</parameter> <type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>void</type></entry>
+       <entry>
+        Remove prepared foreign transaction entries. This function search prepared foreign
+        transactions matching the criteria and remove then. This function doesn't remove
+        an entry which is locked by some other backend.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+  </sect2>
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 8a97936..2fd44f7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_fdw_xact_resolver</structname><indexterm><primary>pg_stat_fdw_xact_resolver</primary></indexterm></entry>
+      <entry>One row per foreign transaction resolver process, showing statistics about
+       foreign transaction resolution. See <xref linkend="pg-stat-foreign-xact-resolver-view"/> for
+       details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -2182,6 +2190,42 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connection.
   </para>
 
+  <table id="pg-stat-foreign-xact-resolver-view" xreflabel="pg_stat_fdw_xact_resolver">
+   <title><structname>pg_stat_fdw_xact_resolver</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of a foreign transaction resolver process</entry>
+    </row>
+    <row>
+     <entry><structfield>dbid</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>OID of the database that the foreign transaction resolver process connects to</entry>
+    </row>
+    <row>
+     <entry><structfield>last_resolution_time</structfield></entry>
+     <entry><type>timestamp with time zone</type></entry>
+     <entry>Time of last reolucation of foreign transaction</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_fdw_xact_resolver</structname> view will contain one
+   row per foreign transaction resolver process, showing state of resolution
+   of foreign trasactions.
+  </para>
 
   <table id="pg-stat-archiver-view" xreflabel="pg_stat_archiver">
    <title><structname>pg_stat_archiver</structname> View</title>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..742e825 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,9 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdwxactdesc.o \
+	genericdesc.o  gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdwxactdesc.c b/src/backend/access/rmgrdesc/fdwxactdesc.c
new file mode 100644
index 0000000..b262645
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdwxactdesc.c
@@ -0,0 +1,68 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdwxact.h"
+#include "access/xloginsert.h"
+#include "lib/stringinfo.h"
+
+void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FdwXactOnDiskData *fdw_insert_xlog = (FdwXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: %s",
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f72f076..d5ce90d 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,16 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s "
+						 "max_prepared_foreign_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..90d0056 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,9 +12,9 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
-	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
-	xact.o xlog.o xlogarchive.o xlogfuncs.o \
+OBJS = clog.o commit_ts.o fdwxact.o fdwxact_resolver.o generic_xlog.o multixact.o \
+	parallel.o rmgr.o slru.o subtrans.o timeline.o transam.o twophase.o \
+	twophase_rmgr.o varsup.o xact.o xlog.o xlogarchive.o xlogfuncs.o \
 	xloginsert.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/fdwxact.c b/src/backend/access/transam/fdwxact.c
new file mode 100755
index 0000000..91791b0
--- /dev/null
+++ b/src/backend/access/transam/fdwxact.c
@@ -0,0 +1,2448 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact.c
+ *		PostgreSQL distributed transaction manager for foreign servers.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdwxact.c
+ *
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign servers.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server. it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid fo foreign server and user.
+ *
+ * The commit is executed in two phases. In the first phase executed during
+ * pre-commit phase, transactions are prepared on all the foreign servers,
+ * which can participate in two-phase commit protocol. Transaction on other
+ * foreign servers are committed in the same phase. In the second phase, if
+ * first phase doesn not succeed for whatever reason, the foreign servers
+ * are asked to rollback respective prepared transactions or abort the
+ * transactions if they are not prepared. This process is executed by backend
+ * process that executed the first phase. If the first phase succeeds, the
+ * backend process registers ourselves to the queue in the shared memory and then
+ * ask the foreign transaction resolver process to resolve foreign transactions
+ * that are associated with the its transaction. After resolved all foreign
+ * transactions by foreign transaction resolve process the backend wakes up
+ * and resume to process.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved (aka dangling transaction). During the
+ * first phase, before actually preparing the transactions, enough information
+ * is persisted to the disk and logs in order to resolve such transactions.
+ *
+ * During replay WAL and replication FdwXactCtl also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdwxact records happens by the following rules:
+ *
+ * 	* On PREPARE redo we add the foreign transaction to FdwXactCtl->fdw_xacts.
+ *	  We set fdw_xact->inredo to true for such entries.
+ *	* On Checkpoint redo we iterate through FdwXactCtl->fdw_xacts entries that
+ *	  that have fdw_xact->inredo set and are behind the redo_horizon.
+ *	  We save them to disk and alos set fdw_xact->ondisk to true.
+ *	* On COMMIT and ABORT we delete the entry from FdwXactCtl->fdw_xacts.
+ *	  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *	  the disk as well.
+ *	* RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+ *	  have been modified to go through fdw_xact->inredo entries that have
+ *	  not made to disk yet.
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/htup_details.h"
+#include "access/twophase.h"
+#include "access/resolver_private.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/ps_status.h"
+#include "utils/snapmgr.h"
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FdwXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	bool		modified;		/* modified on foreign server in the transaction */
+	GetPrepareId_function get_prepare_id;
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FdwConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFdwConnections = NIL;
+
+/* Shmem hash entry */
+typedef struct
+{
+	/* tag */
+	TransactionId	xid;
+
+	/* data */
+	FdwXact	first_entry;
+} FdwXactHashEntry;
+
+static HTAB	*FdwXactHash;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FdwXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+			 serverid, userid)
+
+static FdwXact FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_info);
+static void FdwXactPrepareForeignTransactions(void);
+static void AtProcExit_FdwXact(int code, Datum arg);
+static bool FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+											 ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static void UnlockMyFdwXacts(void);
+static void remove_fdw_xact(FdwXact fdw_xact);
+static FdwXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, char *fdw_xact_id);
+static int	GetFdwXactList(FdwXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FdwXact fdw_xact);
+static FdwXactOnDiskData *ReadFdwXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len);
+static FdwXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+static bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+							List **qualifying_xacts);
+
+static void FdwXactQueueInsert(void);
+static void FdwXactCancelWait(void);
+
+/* Guc parameters */
+int			max_prepared_foreign_xacts = 0;
+int			max_foreign_xact_resolvers = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Foreign transaction entries locked by this backend */
+List	   *MyLockedFdwXacts = NIL;
+FdwXactResolver *MyFdwXactResolver = NULL;
+
+/* Record the server, userid participating in the transaction. */
+void
+FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool two_phase_commit,
+							 bool modify)
+{
+	FdwConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Quick return if the entry already exists */
+	foreach(lcell, MyFdwConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		/* Quick return if there is already registered connection */
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+		{
+			fdw_conn->modified |= modify;
+			return;
+		}
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FdwConnection *) palloc(sizeof(FdwConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (max_foreign_xact_resolvers == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_foreign_xact_resolvers to a nonzero value.")));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->modified = modify;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFdwConnections = lappend(MyFdwConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/*
+ * FdwXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+Size
+FdwXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FdwXactCtlData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXactData)));
+
+	size = MAXALIGN(size);
+	size = add_size(size, hash_estimate_size(max_prepared_foreign_xacts,
+											 sizeof(FdwXactHashEntry)));
+
+	return size;
+}
+
+/*
+ * FdwXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FdwXactCtlData structure.
+ */
+void
+FdwXactShmemInit(void)
+{
+	bool		found;
+
+	FdwXactCtl = ShmemInitStruct("Foreign transactions table",
+								 FdwXactShmemSize(),
+								 &found);
+	if (!IsUnderPostmaster)
+	{
+		FdwXact		fdw_xacts;
+		HASHCTL		info;
+		long		init_hash_size;
+		long		max_hash_size;
+		int			cnt;
+
+		Assert(!found);
+		FdwXactCtl->freeFdwXacts = NULL;
+		FdwXactCtl->numFdwXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FdwXact)
+			((char *) FdwXactCtl +
+			 MAXALIGN(offsetof(FdwXactCtlData, fdw_xacts) +
+					  sizeof(FdwXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = &fdw_xacts[cnt];
+		}
+
+		MemSet(&info, 0, sizeof(info));
+		info.keysize = sizeof(TransactionId);
+		info.entrysize = sizeof(FdwXactHashEntry);
+
+		max_hash_size = max_prepared_foreign_xacts;
+		init_hash_size = max_hash_size / 2;
+
+		FdwXactHash = ShmemInitHash("FdwXact hash",
+									init_hash_size,
+									max_hash_size,
+									&info,
+									HASH_ELEM | HASH_BLOBS);
+	}
+	else
+	{
+		Assert(FdwXactCtl);
+		Assert(found);
+	}
+}
+
+
+/*
+ * PreCommit_FdwXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. But in case of where only one server
+ * that can execute two-phase-commit protocol is involved with transaction and
+ * no changes is made on local data then we don't need to two-phase-commit protocol,
+ * so try to commit transaction on the server. Those will be aborted or committed
+ * after the current transaction has been aborted or committed resp. We try to
+ * commit transactions on rest of the foreign servers now. For these foreign
+ * servers it is possible that some transactions commit even if the local
+ * transaction aborts.
+ */
+void
+PreCommit_FdwXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFdwConnections), prev = NULL; cur; cur = next)
+	{
+		FdwConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		/*
+		 * We commit the foreign transactions on servers either that cannot
+		 * execute two-phase-commit protocol or that we didn't modified on
+		 * in pre-commit phase.
+		 */
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/*
+	 * Here,  we have committed foreign transactions on foreign servers that can
+	 * not execute two-phase-commit protocol and MyFdwConnections has only foreign
+	 * servers that can execute two-phase-commit protocol. We prepare foreign
+	 * transactions if two-phase-commit protocol is required.
+	 */
+	if (TwoPhaseCommitRequired())
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		FdwXactPrepareForeignTransactions();
+	}
+	else if (list_length(MyFdwConnections) == 1)
+	{
+		FdwConnection *fdw_conn = lfirst(list_head(MyFdwConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol only one server
+		 * remaining even if this server can execute two-phase-commit
+		 * protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFdwConnections should be cleared here */
+		MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+FdwXactPrepareForeignTransactions(void)
+{
+	ListCell   *lcell;
+	FdwXact		prev_fdwxact = NULL;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFdwConnections)
+	{
+		FdwConnection *fdw_conn = (FdwConnection *) lfirst(lcell);
+		char	    *fdw_xact_id;
+		int			fdw_xact_id_len;
+		FdwXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+			continue;
+
+
+		/* Generate prepare transaction id for foreign server */
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = FdwXactRegisterFdwXact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id);
+
+		/*
+		 * Between FdwXactRegisterFdwXact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by the foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id))
+		{
+			StringInfo servername;
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			servername = makeStringInfo();
+			appendStringInfoString(servername, fdw_conn->servername);
+			MyFdwConnections = list_delete_ptr(MyFdwConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							servername->data)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+
+		/*
+		 * If this is the first fdwxact entry we keep it in the hash table for
+		 * the later use.
+		 */
+		if (!prev_fdwxact)
+		{
+			FdwXactHashEntry	*fdwxact_entry;
+			bool				found;
+			TransactionId		key;
+
+			key = fdw_xact->local_xid;
+
+			LWLockAcquire(FdwXactLock,LW_EXCLUSIVE);
+			fdwxact_entry = (FdwXactHashEntry *) hash_search(FdwXactHash,
+															 &key,
+															 HASH_ENTER, &found);
+			LWLockRelease(FdwXactLock);
+
+			Assert(!found);
+			fdwxact_entry->first_entry = fdw_xact;
+		}
+		else
+		{
+			/*
+			 * Make a list of fdwxacts that are associated with the
+			 * same local transaction.
+			 */
+			Assert(fdw_xact->fx_next == NULL);
+			prev_fdwxact->fx_next = fdw_xact;
+		}
+
+		prev_fdwxact = fdw_xact;
+	}
+
+	return;
+}
+
+/*
+ * FdwXactRegisterFdwXact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FdwXact
+FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+					   Oid umid, char *fdw_xact_id)
+{
+	FdwXact		fdw_xact;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	MemoryContext	old_context;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid, fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+
+	LWLockRelease(FdwXactLock);
+
+	/* Remember that we have locked this entry. */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+	MemoryContextSwitchTo(old_context);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FdwXactOnDiskData, fdw_xact_id);
+	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FdwXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   FDW_XACT_ID_LEN);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FdwXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FdwXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				char *fdw_xact_id)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FdwXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get a next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FdwXactCtl->freeFdwXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions : \"%d\".",
+						 max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FdwXactCtl->freeFdwXacts;
+	FdwXactCtl->freeFdwXacts = fdw_xact->fx_free_next;
+
+	/* Insert the entry to active array */
+	Assert(FdwXactCtl->numFdwXacts < max_prepared_foreign_xacts);
+	FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FdwXact fdw_xact)
+{
+	int			cnt;
+
+	Assert(fdw_xact != NULL);
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		if (FdwXactCtl->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FdwXactCtl->numFdwXacts--;
+			FdwXactCtl->fdw_xacts[cnt] = FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			fdw_xact->fx_next = NULL;
+			MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdw_xact);
+
+			LWLockRelease(FdwXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find foreign transaction entry for xid %u, foreign server %u, and user %u",
+		 fdw_xact->local_xid, fdw_xact->serverid, fdw_xact->userid);
+}
+
+/*
+ * We don't need to use two-phase-commit protocol if there is only one foreign
+ * server that that can execute two-phase-commit and didn't write no local
+ * node.
+ */
+bool
+TwoPhaseCommitRequired(void)
+{
+	if ((list_length(MyFdwConnections) > 1) ||
+		(list_length(MyFdwConnections) == 1 && TransactionDidWrite))
+		return true;
+
+	return false;
+}
+
+/*
+ * UnlockMyFdwXacts
+ *
+ * Unlock the foreign transaction entries locked by this backend and removing
+ * it from the backend's list of foreign transactions.
+ */
+static void
+UnlockMyFdwXacts(void)
+{
+	ListCell *cell;
+	ListCell *next;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	for (cell = list_head(MyLockedFdwXacts); cell != NULL; cell = next)
+	{
+		FdwXact	fdwxact = (FdwXact) lfirst(cell);
+
+		next = lnext(cell);
+
+		/*
+		 * Because the resolver process can removed a fdwxact entry
+		 * after resolved and it will be free, it can happen that
+		 * a fdwxact entry in MyLockedFdwXacts is locked by an another
+		 * backend if it locked the entry before we unlock. So we need to
+		 * check if entries are begin locked by MyBackendId.
+		 */
+		if (fdwxact->locking_backend == MyBackendId)
+		{
+			/*
+			 * We set the locking backend as invalid, and then remove it from
+			 * the list of locked foreign transactions, under the LWLock. If we reverse
+			 * the order and process exits in-between those two, we will left an
+			 * entry locked by this backend, which gets unlocked only at the server
+			 * restart.
+			 */
+
+			fdwxact->locking_backend = InvalidBackendId;
+			MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdwxact);
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * AtProcExit_FdwXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FdwXact(int code, Datum arg)
+{
+	UnlockMyFdwXacts();
+}
+
+/*
+ * Wait for foreign transaction resolution, if requested by user.
+ *
+ * Initially backends start in state FDW_XACT_NOT_WAITING and then
+ * change that state to FDW_XACT_WAITING before adding ourselves
+ * to the wait queue. During FdwXactResolveForeignTransactions a fdwxact
+ * resolver changes the state to FDW_XACT_WAIT_COMPLETE once foreign
+ * transactions are resolved. This backend then resets its state
+ * to FDW_XACT_NOT_WAITING. If fdwxact_list is NULL, it means that
+ * we use the list of FdwXact just used, so set it to MyLockedFdwXacts.
+ *
+ * This function is inspired by SyncRepWaitForLSN.
+ */
+void
+FdwXactWaitForResolution(TransactionId wait_xid, bool is_commit)
+{
+	char		*new_status = NULL;
+	const char	*old_status;
+	ListCell	*cell;
+	List		*entries_to_resolve;
+
+	/*
+	 * Quick exit if user has not requested foreign transaction resolution
+	 * or there are no foreign servers that are modified in the current
+	 * transaction.
+	 */
+	if (!FdwXactEnabled())
+		return;
+
+	Assert(FdwXactCtl != NULL);
+	Assert(TransactionIdIsValid(wait_xid));
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	Assert(MyProc->fdwXactState == FDW_XACT_NOT_WAITING);
+
+	/*
+	 * Get the list of foreign transactions that are involved with the
+	 * given wait_xid.
+	 */
+	search_fdw_xact(wait_xid, MyDatabaseId, InvalidOid, InvalidOid,
+					&entries_to_resolve);
+
+	/* Quick exit if we found no foreign transaction that we need to resolve */
+	if (list_length(entries_to_resolve) <= 0)
+		return;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Change status of fdw_xact entries according to is_commit */
+	foreach (cell, entries_to_resolve)
+	{
+		FdwXact fdw_xact = (FdwXact) lfirst(cell);
+
+		/* Don't overwrite status if fate is determined */
+		if (fdw_xact->status == FDW_XACT_PREPARING)
+			fdw_xact->status = (is_commit ?
+								FDW_XACT_COMMITTING_PREPARED :
+								FDW_XACT_ABORTING_PREPARED);
+	}
+
+	/* Set backend status and enqueue itself */
+	MyProc->fdwXactState = FDW_XACT_WAITING;
+	MyProc->waitXid = wait_xid;
+	FdwXactQueueInsert();
+	LWLockRelease(FdwXactLock);
+
+	/* Launch a resolver process if not yet and then wake up it */
+	fdwxact_maybe_launch_resolver();
+
+	/*
+	 * Alter ps display to show waiting for foreign transaction
+	 * resolution.
+	 */
+	if (update_process_title)
+	{
+		int len;
+
+		old_status = get_ps_display(&len);
+		new_status = (char *) palloc(len + 31 + 1);
+		memcpy(new_status, old_status, len);
+		sprintf(new_status + len, " waiting for resolve %d", wait_xid);
+		set_ps_display(new_status, false);
+		new_status[len] = '\0';	/* truncate off "waiting ..." */
+	}
+
+	/* Wait for all foreign transactions to be resolved */
+	for (;;)
+	{
+		/* Must reset the latch before testing state */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Acquiring the lock is not needed, the latch ensures proper
+		 * barriers. If it looks like we're done, we must really be done,
+		 * because once walsender changes the state to FDW_XACT_WAIT_COMPLETE,
+		 * it will never update it again, so we can't be seeing a stale value
+		 * in that case.
+		 */
+		if (MyProc->fdwXactState == FDW_XACT_WAIT_COMPLETE)
+			break;
+
+		/*
+		 * If a wait for foreign transaction resolution is pending, we can
+		 * neither acknowledge the commit nor raise ERROR or FATAL.  The latter
+		 * would lead the client to believe that the distributed transaction
+		 * aborted, which is not true: it's already committed locally. The
+		 * former is no good either: the client has requested committing a
+		 * distributed transaction, and is entitled to assume that a acknowledged
+		 * commit is also commit on all foreign servers, which might not be
+		 * true. So in this case we issue a WARNING (which some clients may
+		 * be able to interpret) and shut off further output. We do NOT reset
+		 * PorcDiePending, so that the process will die after the commit is
+		 * cleaned up.
+		 */
+		if (ProcDiePending)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("canceling the wait for resolving foreign transaction and terminating connection due to administrator command"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If a query cancel interrupt arrives we just terminate the wait with
+		 * a suitable warning. The foreign transactions can be orphaned but
+		 * the foreign xact resolver can pick up them and tries to resolve them
+		 * later.
+		 */
+		if (QueryCancelPending)
+		{
+			QueryCancelPending = false;
+			ereport(WARNING,
+					(errmsg("canceling wait for resolving foreign transaction due to user request"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If the postmaster dies, we'll probably never get an
+		 * acknowledgement, because all the wal sender processes will exit. So
+		 * just bail out.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * Wait on latch.  Any condition that should wake us up will set the
+		 * latch, so no need for timeout.
+		 */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+				  WAIT_EVENT_FDW_XACT_RESOLUTION);
+	}
+
+	pg_read_barrier();
+
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+
+	/*
+	 * Unlock the list of locked entries, also means that the entries
+	 * that could not resolved are remained as dangling transactions.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+}
+
+static void
+FdwXactCancelWait(void)
+{
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+		SHMQueueDelete(&(MyProc->fdwXactLinks));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Insert MyProc into the FdwXactQueue.
+ */
+static void
+FdwXactQueueInsert(void)
+{
+	SHMQueueInsertBefore(&(FdwXactRslvCtl->FdwXactQueue),
+						 &(MyProc->fdwXactLinks));
+}
+
+/*
+ * Resolve all foreign transactions associated with the same local
+ * transaction in the given database. rAfter resolved all foreign
+ * transactions we release the waiter. Return the number of foreign
+ * transaction we resolved.
+ */
+int
+FdwXactResolveForeignTransactions(Oid dbid)
+{
+	TransactionId		key;
+	volatile FdwXact	fdwxact = NULL;
+	volatile FdwXact	fx_next;
+	FdwXactHashEntry	*fdwxact_entry;
+	int					n_resolved = 0;
+	bool				found;
+	PGPROC				*proc;
+
+	Assert(OidIsValid(dbid));
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Fetch an proc from beginning of the queue */
+	for (;;)
+	{
+		proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->FdwXactQueue),
+									   &(FdwXactRslvCtl->FdwXactQueue),
+									   offsetof(PGPROC, fdwXactLinks));
+
+		/* Return if there is not any entry in the queue */
+		if (!proc)
+		{
+			LWLockRelease(FdwXactLock);
+			return false;
+		}
+
+		/* Found a target proc we need to resolve */
+		if (proc->databaseId == dbid)
+			break;
+	}
+
+	Assert(TransactionIdIsValid(proc->waitXid));
+
+	/* Search fdwxact entry from the shmem hash by local transaction id */
+	key = proc->waitXid;
+	fdwxact_entry = (FdwXactHashEntry *) hash_search(FdwXactHash,
+													 (void *) &key,
+													 HASH_ENTER, &found);
+
+	/*
+	 * The fdwxact entry can not be found in the hash table after recovery.
+	 * In this case, we initialize it and construct a list of fdw xact entries
+	 * in FdwXactHashEntry with foreign transaction associated with the same
+	 * transaction.
+	 */
+	if (!found)
+	{
+		int i;
+		FdwXact prev_fx = NULL;
+
+		/* Initialize entry */
+		fdwxact_entry->xid = proc->waitXid;
+		fdwxact_entry->first_entry = NULL;
+
+		for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+		{
+			FdwXact fx = FdwXactCtl->fdw_xacts[i];
+
+			if (fx->dboid == dbid && fx->local_xid == proc->waitXid)
+			{
+				/* Save first entry of the list */
+				if (fdwxact_entry->first_entry == NULL)
+					fdwxact_entry->first_entry = fx;
+
+				/* Link from previous entry to this entry */
+				if (prev_fx)
+					prev_fx->fx_next = fx;
+
+				prev_fx = fx;
+			}
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+	fdwxact = fdwxact_entry->first_entry;
+
+	/* Resolve all foreign transactions associated with pgxact->xid */
+	while (fdwxact)
+	{
+		/*
+		 * Remember the next entry to resolve since current entry will be
+		 * removed after resolved.
+		 */
+		fx_next = fdwxact->fx_next;
+
+		/*
+		 * Resolve foreign transaction. We keep trying to resolve foreign
+		 * transaction until got successful.
+		 *
+		 * XXX : We might want to try to resolve other foreign transactions
+		 * as mush as possible.
+		 */
+		while (!FdwXactResolveForeignTransaction(fdwxact,
+												 get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * If failed to resolve, we try to resolve the foreign transaction
+			 * after for a while.
+			 */
+			pg_usleep(1 * 1000L * 1000L);	/* 1 sec */
+		}
+
+		n_resolved++;
+		fdwxact = fx_next;
+	}
+
+	/* We remove proc from shmem hash table as well */
+	hash_search(FdwXactHash, (void *) &key, HASH_REMOVE, NULL);
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/*
+	 * Remove proc from queue, if not detached yet. If the waiter has
+	 * canceled to wait before resolution, this proc is already detached.
+	 */
+	if (!SHMQueueIsDetached(&(proc->fdwXactLinks)))
+	{
+		SHMQueueDelete(&(proc->fdwXactLinks));
+
+		pg_write_barrier();
+
+		/* Set state to complete */
+		proc->fdwXactState = FDW_XACT_WAIT_COMPLETE;
+
+		/* Wake up the waiter only when we have set state and removed from queue */
+		SetLatch(&(proc->procLatch));
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	return n_resolved;
+}
+
+/*
+ * Resolve all dangling foreign transactions as much as possible in the given
+ * database. Return the number of foreign transaction we resolved.
+ */
+int
+FdwXactResolveDanglingTransactions(Oid dbid)
+{
+	List		*fxact_list = NIL;
+	ListCell	*cell;
+	bool		n_resolved = 0;
+	int			i;
+
+	Assert(OidIsValid(dbid));
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	/*
+	 * Make a list of dangling transactions of which corresponding local
+	 * transaction is on the same database.
+	 */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		FdwXact fxact = FdwXactCtl->fdw_xacts[i];
+
+		/*
+		 * Append it to the list if the fdwxact entry that is on the same
+		 * database if it's not locked by anyone and its local transaction
+		 * is not prepared.
+		 */
+		if (fxact->dboid == dbid &&
+			fxact->locking_backend == InvalidBackendId &&
+			!TwoPhaseExists(fxact->local_xid))
+			fxact_list = lappend(fxact_list, fxact);
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	/* There is no foreign transaction we need to resolve */
+	if (list_length(fxact_list) == 0)
+		return 0;
+
+	foreach(cell, fxact_list)
+	{
+		FdwXact fdwxact = (FdwXact) lfirst(cell);
+
+		if (!FdwXactResolveForeignTransaction(fdwxact,
+											  get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			/*
+			 * If failed to resolve this foreign transaction we skip it in
+			 * this resolution cycle. Try to resolve again in next cycle.
+			 */
+			ereport(WARNING, (errmsg("could not resolve dangling foreign transaction for xid %u, foreign server %u and user %d",
+									 fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+			continue;
+		}
+
+		n_resolved++;
+	}
+
+	list_free(fxact_list);
+
+	return n_resolved;
+}
+
+/*
+ * AtEOXact_FdwXacts
+ *
+ */
+extern void
+AtEOXact_FdwXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	/*
+	 * In commit case, we already committed the foreign transactions on the
+	 * servers that cannot execute two-phase commit protocol, and prepared
+	 * transaction on the server that can use two-phase commit protocol
+	 * in-precommit phase. And the prepared transactions should be resolved
+	 * by the resolver process. On the other hand in abort case, since we
+	 * might either prepare or be preparing some transactions on foreign
+	 * servers we need to abort prepared transactions while just abort the
+	 * foreign transaction that are not prepared yet.
+	 */
+	if (!is_commit)
+	{
+		foreach (lcell, MyFdwConnections)
+		{
+			FdwConnection	*fdw_conn = lfirst(lcell);
+
+			/*
+			 * Since the prepared foreign transaction should have been
+			 * resolved we abort the remaining not-prepared foreign
+			 * transactions.
+			 */
+			if (!fdw_conn->fdw_xact)
+			{
+				bool ret;
+
+				ret = fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+												 fdw_conn->umid, is_commit);
+				if (!ret)
+					ereport(WARNING, (errmsg("could not abort transaction on server \"%s\"",
+											 fdw_conn->servername)));
+			}
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Other backend might lock the
+	 * entry we used to lock, but there is no reason for a foreign transaction
+	 * entry to be locked after the transaction which locked it has ended.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFdwConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FdwXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ *
+ * Note that it can happen that the transaction abort after we prepared foreign
+ * transactions. So we cannot unlock both MyLockedFdwXacts and MyFdwConnections
+ * here. These are unlocked after rollbacked by resolver process during
+ * aborting, or at EOXact_FdwXacts().
+ */
+void
+AtPrepare_FdwXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	FdwXactPrepareForeignTransactions();
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FdwXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * FdwXactResolveForeignTransaction
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine. The foreign transaction can be a dangling transaction
+ * that is not decided to commit or abort.
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+			   ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	if(!(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		 fdw_xact->status == FDW_XACT_ABORTING_PREPARED))
+		elog(DEBUG1, "fdwxact status : %d", fdw_xact->status);
+
+	/*
+	 * Determine whether we commit or abort this foreign transaction.
+	 */
+	if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED)
+		is_commit = true;
+
+	else if (fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+		is_commit = false;
+
+	/*
+	 * If the local transaction is already committed, commit prepared
+	 * foreign transactions as well.
+	 */
+	else if (TransactionIdDidCommit(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+		is_commit = true;
+	}
+
+	/*
+	 * If the local transaction is already aborted, abort prepared
+	 * foreign transactions as well.
+	 */
+	else if (TransactionIdDidAbort(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+		is_commit = false;
+	}
+
+	/*
+	 * The local transaction is not in progress but the foreign
+	 * transaction is not prepared on the foreign server. This
+	 * can happen when we crashed after registered this entry but
+	 * before actual preparing on the foreign server. So we assume
+	 * it to be aborted.
+	 */
+	else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+		is_commit = false;
+
+	/*
+	 * The Local transaction is in progress and foreign transaction
+	 * state is neither committing or aborting. This should not
+	 * happen because we cannot determine to do commit or abort for
+	 * foreign transaction associated with the in-progress local transaction.
+	 */
+	else
+		ereport(ERROR,
+				(errmsg("cannot resolve foreign transaction associated with in-progress transaction %u on server %u",
+						fdw_xact->local_xid, fdw_xact->serverid)));
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FdwXactCtl->fdw_xacts. Return NULL
+ * if foreign transaction does not exist.
+ */
+static FdwXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FdwXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+static bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FdwXactLock, lock_mode);
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FdwXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FdwXact entry and file if exists */
+		FdwXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFdwXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FdwXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFdwXacts(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FdwXactCtl->numFdwXacts <= 0)
+	{
+		LWLockRelease(FdwXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FdwXact that need to be copied to
+	 * disk, so we perform all I/O while holding FdwXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FdwXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFdwXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFdwXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FdwXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+Datum
+pg_prepared_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFdwXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FdwXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FdwXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFdwXactList(FdwXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	if (FdwXactCtl->numFdwXacts == 0)
+	{
+		LWLockRelease(FdwXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FdwXactCtl->numFdwXacts;
+	*fdw_xacts = (FdwXact) palloc(sizeof(FdwXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FdwXactCtl->fdw_xacts[cnt_xacts],
+			   sizeof(FdwXactData));
+
+	LWLockRelease(FdwXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * Built-in function to remove prepared foreign transaction entry/s without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case
+ * 1. The foreign server where it is prepared is no longer available
+ * 2. The user which prepared this transaction needs to be dropped
+ * 3. PITR is recovering before a transaction id, which created the prepared
+ *	  foreign transaction
+ * 4. The database containing the entries needs to be dropped
+ *
+ * Or any such conditions in which resolution is no longer possible.
+ *
+ * The function accepts 4 arguments transaction id, dbid, serverid and userid,
+ * which define the criteria in the same way as search_fdw_xact(). The entries
+ * matching the criteria are removed. The function does not remove an entry
+ * which is locked by some other backend.
+ */
+Datum
+pg_remove_fdw_xacts(PG_FUNCTION_ARGS)
+{
+/* Some #defines only for this function to deal with the arguments */
+#define XID_ARGNUM	0
+#define DBID_ARGNUM 1
+#define SRVID_ARGNUM 2
+#define USRID_ARGNUM 3
+
+	TransactionId xid;
+	Oid			dbid;
+	Oid			serverid;
+	Oid			userid;
+	List	   *entries_to_remove;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to remove foreign transactions"))));
+
+	xid = PG_ARGISNULL(XID_ARGNUM) ? InvalidTransactionId :
+		DatumGetTransactionId(PG_GETARG_DATUM(XID_ARGNUM));
+	dbid = PG_ARGISNULL(DBID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(DBID_ARGNUM);
+	serverid = PG_ARGISNULL(SRVID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(SRVID_ARGNUM);
+	userid = PG_ARGISNULL(USRID_ARGNUM) ? InvalidOid :
+		PG_GETARG_OID(USRID_ARGNUM);
+
+	search_fdw_xact(xid, dbid, serverid, userid, &entries_to_remove);
+
+	while (entries_to_remove)
+	{
+		FdwXact		fdw_xact = linitial(entries_to_remove);
+
+		entries_to_remove = list_delete_first(entries_to_remove);
+
+		remove_fdw_xact(fdw_xact);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Resolve foreign transactions on the connecting database manually. This
+ * function returns true if we resolve any foreign transaction, otherwise
+ * return false.
+ */
+Datum
+pg_resolve_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	bool    ret;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to resolve foreign transactions"))));
+
+	ret = FdwXactResolveForeignTransactions(MyDatabaseId);
+	ret |= FdwXactResolveDanglingTransactions(MyDatabaseId);
+
+	PG_RETURN_BOOL(ret);
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFdwXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FdwXactOnDiskData *
+ReadFdwXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FdwXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FdwXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFdwXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFdwXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFdwXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFdwXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FdwXactOnDiskData *fdw_xact_file_data;
+			FdwXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFdwXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FdwXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FdwXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FdwXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FdwXactData structure.
+ */
+void
+FdwXactRedoAdd(XLogReaderState *record)
+{
+	FdwXactOnDiskData *fdw_xact_data = (FdwXactOnDiskData *) XLogRecGetData(record);
+	FdwXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	fdw_xact->valid = true;
+	LWLockRelease(FdwXactLock);
+}
+/*
+ * FdwXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FdwXactCtl.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FdwXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFdwXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/fdwxact_resolver.c b/src/backend/access/transam/fdwxact_resolver.c
new file mode 100644
index 0000000..c46bc17
--- /dev/null
+++ b/src/backend/access/transam/fdwxact_resolver.c
@@ -0,0 +1,522 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver.c
+ *
+ * PostgreSQL foreign transaction resolver background worker
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/fdwxact_resolver.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/resolver_private.h"
+#include "access/transam.h"
+
+#include "funcapi.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* GUC parameters */
+int foreign_xact_resolution_interval;
+int foreign_xact_resolver_timeout = 60 * 1000;
+
+FdwXactRslvCtlData *FdwXactRslvCtl;
+
+static long FdwXactRslvComputeSleepTime(TimestampTz now);
+static void FdwXactRslvProcessForeignTransactions(void);
+
+static void fdwxact_resolver_sighup(SIGNAL_ARGS);
+static void fdwxact_resolver_onexit(int code, Datum arg);
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactResolverShmemSize(void)
+{
+	Size		size = 0;
+
+	size = add_size(size, mul_size(max_foreign_xact_resolvers,
+								   sizeof(FdwXactResolver)));
+
+	return size;
+}
+
+/*
+ * Allocate and initialize foreign transaction resolver shared
+ * memory.
+ */
+void
+FdwXactResolverShmemInit(void)
+{
+	bool found;
+
+	FdwXactRslvCtl = ShmemInitStruct("Foreign transactions resolvers",
+									 FdwXactResolverShmemSize(),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		int	slot;
+
+		/* First time through, so initialize */
+		MemSet(FdwXactRslvCtl, 0, FdwXactResolverShmemSize());
+
+		SHMQueueInit(&(FdwXactRslvCtl->FdwXactQueue));
+
+		for (slot = 0; slot < max_foreign_xact_resolvers; slot++)
+		{
+			FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[slot];
+
+			SpinLockInit(&(resolver->mutex));
+		}
+	}
+}
+
+/*
+ * Cleanup up foreign transaction resolver info.
+ */
+static void
+fdwxact_resolver_onexit(int code, Datum arg)
+{
+	MyFdwXactResolver->pid = InvalidPid;
+	MyFdwXactResolver->in_use = false;
+}
+
+/*
+ * Attach to a slot.
+ */
+void
+fdwxact_resolver_attach(int slot)
+{
+	Assert(slot >= 0 && slot < max_foreign_xact_resolvers);
+
+	/* Block concurrent access */
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	MyFdwXactResolver = &FdwXactRslvCtl->resolvers[slot];
+
+	SpinLockAcquire(&(MyFdwXactResolver->mutex));
+	MyFdwXactResolver->pid = MyProcPid;
+	SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+	MyFdwXactResolver->latch = &MyProc->procLatch;
+
+	if (!MyFdwXactResolver->in_use)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("foreign transaction resolver slot %d is empty, cannot attach",
+						slot)));
+	}
+
+	before_shmem_exit(fdwxact_resolver_onexit, (Datum) 0);
+
+	LWLockRelease(FdwXactResolverLock);
+}
+
+/* Set flag to reload configuration at next convenient time */
+static void
+fdwxact_resolver_sighup(SIGNAL_ARGS)
+{
+	int		save_errno = errno;
+
+	got_SIGHUP = true;
+
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/* Foreign transaction resolver entry point */
+void
+FdwXactRslvMain(Datum main_arg)
+{
+	int slot = DatumGetInt32(main_arg);
+
+	/* Attach to a slot */
+	fdwxact_resolver_attach(slot);
+
+	elog(DEBUG1, "foreign transaciton resolver for database %u started",
+		 MyFdwXactResolver->dbid);
+
+	/* Establish signal handlers */
+	pqsignal(SIGHUP, fdwxact_resolver_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize stats to a sanish value */
+	MyFdwXactResolver->last_resolution_time = GetCurrentTimestamp();
+
+	/* Establish connection to nailed catalogs */
+	BackgroundWorkerInitializeConnectionByOid(MyFdwXactResolver->dbid, InvalidOid);
+
+	for (;;)
+	{
+		int			rc;
+		TimestampTz	now;
+		long		sleep_time;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/* Resolve pending transactions if there are */
+		FdwXactRslvProcessForeignTransactions();
+
+		now = GetCurrentTimestamp();
+
+		sleep_time = FdwXactRslvComputeSleepTime(now);
+
+		if (sleep_time < 0)
+		{
+			/*
+			 * We reached to the timeout here. We can exit if there is on
+			 * pending foreign transactions in the shmem queue. Check it and
+			 * then close the business while holding FdwXactResolverLaunchLock.
+			 */
+			if (!fdw_xact_exists(InvalidTransactionId, MyDatabaseId, InvalidOid,
+								 InvalidOid))
+			{
+				/*
+				 * There is no more transactions we need to resolve,
+				 * turn off my slot while holding lock so that concurrent
+				 * backends cannot register additional entries.
+				 */
+				SpinLockAcquire(&(MyFdwXactResolver->mutex));
+				MyFdwXactResolver->in_use = false;
+				SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+				ereport(LOG,
+						(errmsg("foreign transaction resolver for database \"%u\" will stop because the timeout",
+								MyFdwXactResolver->dbid)));
+
+				proc_exit(0);
+			}
+
+			/*
+			 * If new foreign transaction came in the meantime of checking, we
+			 * handle it before exits. Since we know we have the pending foreign
+			 * transaction we don't want to sleep.
+			 */
+			sleep_time = 0;
+		}
+
+		rc = WaitLatch(MyLatch,
+					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+					   sleep_time,
+					   WAIT_EVENT_FDW_XACT_RESOLVER_MAIN);
+
+		if (rc & WL_POSTMASTER_DEATH)
+			proc_exit(1);
+	}
+}
+
+/*
+ * Process all foreign transactions on the database it's connecting to. If we
+ * succeeded in resolution we can update the last resolution time. When we resolved
+ * no foreign transaction in a cycle we return.
+ */
+static void
+FdwXactRslvProcessForeignTransactions(void)
+{
+	/*
+	 * Loop until there are no more foreign transaction we need to resolve.
+	 */
+	for (;;)
+	{
+		int	n_resolved;
+		int	n_resolved_dangling;
+
+		StartTransactionCommand();
+
+		/* Resolve all foreign transaction associated with xid */
+		n_resolved = FdwXactResolveForeignTransactions(MyFdwXactResolver->dbid);
+
+		/* Resolve dangling transactions if there are */
+		n_resolved_dangling = FdwXactResolveDanglingTransactions(MyFdwXactResolver->dbid);
+
+		CommitTransactionCommand();
+
+		if (n_resolved > 0)
+		{
+			/* Update my state */
+			SpinLockAcquire(&(MyFdwXactResolver->mutex));
+			MyFdwXactResolver->last_resolution_time = GetCurrentTimestamp();
+			SpinLockRelease(&(MyFdwXactResolver->mutex));
+		}
+
+		/* Return if we didn't resolve any foreign transactions */
+		if (n_resolved == 0 && n_resolved_dangling == 0)
+		return;
+	}
+}
+
+/*
+ * Compute how long we should sleep by the next cycle. Return the sleep time
+ * in milliseconds, -1 means that we reached to the timeout and should exits
+ */
+static long
+FdwXactRslvComputeSleepTime(TimestampTz now)
+{
+	static TimestampTz	wakeuptime = 0;
+	long	sleeptime;
+	long	sec_to_timeout;
+	int		microsec_to_timeout;
+
+	if (foreign_xact_resolver_timeout > 0)
+	{
+		TimestampTz timeout;
+
+		timeout = TimestampTzPlusMilliseconds(MyFdwXactResolver->last_resolution_time,
+											  foreign_xact_resolver_timeout);
+
+		/* If we reached to the timeout, exit */
+		if (now >= timeout)
+			return -1;
+	}
+
+	if (now >= wakeuptime)
+		wakeuptime = TimestampTzPlusMilliseconds(now,
+												 foreign_xact_resolution_interval * 1000);
+
+	/* Compute relative time until wakeup. */
+	TimestampDifference(now, wakeuptime,
+						&sec_to_timeout, &microsec_to_timeout);
+
+	sleeptime = sec_to_timeout * 1000 + microsec_to_timeout / 1000;
+
+	return sleeptime;
+}
+
+/*
+ * Launch a new foreign transaction resolver worker if not launched yet.
+ * A foreign transaction resolver worker is responsible for the resolution
+ * of foreign transactions are registered on one database. So if a resolver
+ * worker already is launched by other backend we don't need to launch new
+ * one.
+ */
+void
+fdwxact_maybe_launch_resolver(void)
+{
+	FdwXactResolver *resolver = NULL;
+	BackgroundWorker bgw;
+	BackgroundWorkerHandle *bgw_handle;
+	int i;
+	int	slot;
+	bool	found = false;
+
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/*
+		 * Found a running resolver that is responsible for the
+		 * database "dbid".
+		 */
+		if (r->in_use && r->pid != InvalidPid && r->dbid == MyDatabaseId)
+		{
+			found = true;
+			resolver = r;
+			break;
+		}
+	}
+
+	/*
+	 * If we found the resolver for my database, we don't need to launch new one.
+	 * Add a task and wake up it.
+	 */
+	if (found)
+	{
+		SetLatch(resolver->latch);
+		LWLockRelease(FdwXactResolverLock);
+
+		elog(DEBUG1, "found a running foreign transaction resolver process for database %u",
+			 MyDatabaseId);
+
+		return;
+	}
+
+	elog(DEBUG1, "starting foreign transaction resolver for datbase ID %u", MyDatabaseId);
+
+	/* Find unused worker slot */
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/* Found an used worker slot */
+		if (!r->in_use)
+		{
+			resolver = r;
+			slot = i;
+			break;
+		}
+	}
+
+	/*
+	 * However if there are no more free worker slots, inform user about it before
+	 * exiting.
+	 */
+	if (resolver == NULL)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of foreign trasanction resolver slots"),
+				 errhint("You might need to increase max_foreign_transaction_resolvers.")));
+
+		return;
+	}
+
+	/* Prepare the resolver slot. It's in use but pid is still invalid */
+	resolver->dbid = MyDatabaseId;
+	resolver->in_use = true;
+	resolver->pid = InvalidPid;
+	TIMESTAMP_NOBEGIN(resolver->last_resolution_time);
+
+	LWLockRelease(FdwXactResolverLock);
+
+	/* Register the new dynamic worker */
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactRslvMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN,
+			 "foreign transaction resolver for database %u", MyDatabaseId);
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "foreign transaction resolver");
+	bgw.bgw_restart_time = BGW_NEVER_RESTART;
+	bgw.bgw_main_arg = (Datum) 0;
+	bgw.bgw_notify_pid = Int32GetDatum(slot);
+
+	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+	{
+		/* Failed to launch, cleanup the worker slot */
+		SpinLockAcquire(&(MyFdwXactResolver->mutex));
+		resolver->in_use = false;
+		SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+		ereport(WARNING,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of background worker slots"),
+				 errhint("You might need to increase max_worker_processes.")));
+	}
+
+	/*
+	 * We don't need to wait until it attaches here because we're going to wait
+	 * until all foreign transactions are resolved.
+	 */
+}
+
+/*
+ * Returns activity of foreign transaction resolvers, including pids, the number
+ * of tasks and the last resolution time.
+ */
+Datum
+pg_stat_get_fdwxact_resolver(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_FDWXACT_RESOLVERS_COLS 3
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	int i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver	*resolver = &FdwXactRslvCtl->resolvers[i];
+		pid_t	pid;
+		Oid		dbid;
+		TimestampTz last_resolution_time;
+		Datum		values[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+		bool		nulls[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+
+
+		SpinLockAcquire(&(MyFdwXactResolver->mutex));
+		if (resolver->pid == 0)
+		{
+			SpinLockRelease(&(MyFdwXactResolver->mutex));
+			continue;
+		}
+
+		pid = resolver->pid;
+		dbid = resolver->dbid;
+		last_resolution_time = resolver->last_resolution_time;
+		SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+		memset(nulls, 0, sizeof(nulls));
+		/* pid */
+		values[0] = Int32GetDatum(pid);
+
+		/* dbid */
+		values[1] = ObjectIdGetDatum(dbid);
+
+		/* last_resolution_time */
+		if (last_resolution_time == 0)
+			nulls[2] = true;
+		else
+			values[2] = TimestampTzGetDatum(last_resolution_time);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b360b1 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 321da9f..d736098 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -845,6 +846,35 @@ TwoPhaseGetGXact(TransactionId xid)
 }
 
 /*
+ * TwoPhaseExists
+ *		Return true if there is a prepared transaction specified by XID
+ */
+bool
+TwoPhaseExists(TransactionId xid)
+{
+	int		i;
+	bool	found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGXACT	*pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+
+		if (pgxact->xid == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return found;
+}
+
+/*
  * TwoPhaseGetDummyProc
  *		Get the dummy backend ID for prepared transaction specified by XID
  *
@@ -2240,6 +2270,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, true);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitForResolution(xid, true);
 }
 
 /*
@@ -2298,6 +2334,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, false);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitForResolution(xid, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e795a2f..588dba8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1132,6 +1133,7 @@ RecordTransactionCommit(void)
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
+	bool		need_twophase;
 
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
@@ -1140,6 +1142,7 @@ RecordTransactionCommit(void)
 		nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
 													 &RelcacheInitFileInval);
 	wrote_xlog = (XactLastRecEnd != 0);
+	need_twophase = TwoPhaseCommitRequired();
 
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1178,12 +1181,13 @@ RecordTransactionCommit(void)
 		}
 
 		/*
-		 * If we didn't create XLOG entries, we're done here; otherwise we
-		 * should trigger flushing those entries the same as a commit record
+		 * If we didn't create XLOG entries and the transaction does not need
+		 * to be committed using two-phase commit. we're done here; otherwise
+		 * we should trigger flushing those entries the same as a commit record
 		 * would.  This will primarily happen for HOT pruning and the like; we
 		 * want these to be flushed to disk in due time.
 		 */
-		if (!wrote_xlog)
+		if (!wrote_xlog && !need_twophase)
 			goto cleanup;
 	}
 	else
@@ -1341,6 +1345,14 @@ RecordTransactionCommit(void)
 	if (wrote_xlog && markXidCommitted)
 		SyncRepWaitForLSN(XactLastRecEnd, true);
 
+	/*
+	 * Wait for prepared foreign transaction to be resolved, if required.
+	 * We only want to wait if we prepared foreign transaction in this
+	 * transaction.
+	 */
+	if (need_twophase && markXidCommitted)
+		FdwXactWaitForResolution(xid, true);
+
 	/* remember end of last commit record */
 	XactLastCommitEnd = XactLastRecEnd;
 
@@ -1978,6 +1990,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FdwXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2133,6 +2148,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_FdwXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2222,6 +2238,8 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	AtPrepare_FdwXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2410,6 +2428,7 @@ PrepareTransaction(void)
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
+	AtEOXact_FdwXacts(true);
 	/* don't call AtEOXact_PgStat here; we fixed pgstat state above */
 	AtEOXact_Snapshot(true, true);
 	pgstat_report_xact_timestamp(0);
@@ -2616,6 +2635,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_FdwXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 3e9a12d..43c4ea4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5157,6 +5158,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6244,6 +6246,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
@@ -6931,8 +6936,12 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
+
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7557,6 +7566,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7743,6 +7753,8 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	RecoverFdwXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9048,6 +9060,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	CheckPointFdwXacts(checkPointRedo);
 }
 
 /*
@@ -9484,7 +9497,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9516,6 +9530,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9713,6 +9728,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9902,6 +9918,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 394aea8..768c7c8 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_prepared_fdw_xacts AS
+       SELECT * FROM pg_prepared_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
@@ -769,6 +772,14 @@ CREATE VIEW pg_stat_subscription AS
             LEFT JOIN pg_stat_get_subscription(NULL) st
                       ON (st.subid = su.oid);
 
+CREATE VIEW pg_stat_fdwxact_resolvers AS
+    SELECT
+            r.pid,
+            r.dbid,
+            r.last_resolution_time
+    FROM pg_stat_get_fdwxact_resolver() r
+    WHERE r.pid IS NOT NULL;
+
 CREATE VIEW pg_stat_ssl AS
     SELECT
             S.pid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 9ad9915..b0e1c8d 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,14 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1412,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+						srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index 88806eb..627978e 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,6 +16,7 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "access/fdwxact_resolver.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -129,6 +130,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"FdwXactRslvMain", FdwXactRslvMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index b502c1b..5cb1ae6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3673,6 +3673,12 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_FDW_XACT_RESOLUTION:
+			event_name = "FdwXactResolution";
+			break;
+		case WAIT_EVENT_FDW_XACT_RESOLVER_MAIN:
+			event_name = "FdwXactResolver";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 17c7f7e..84b4631 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdwxact_resolver.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/ip.h"
@@ -899,6 +900,10 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
 
+	if (max_prepared_foreign_xacts > 0 && max_foreign_xact_resolvers == 0)
+		ereport(ERROR,
+				(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 486fd0c..27716b5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -150,6 +150,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_FDW_XACT_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2d1ed14..ac2d7b8 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,8 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +152,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FdwXactShmemSize());
+		size = add_size(size, FdwXactResolverShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +274,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FdwXactShmemInit();
+	FdwXactResolverShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..a42d06e 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,5 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+FdwXactLock					46
+FdwXactResolverLock			47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 5f6727d..3d09978 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -35,6 +35,7 @@
 #include <unistd.h>
 #include <sys/time.h>
 
+#include "access/fdwxact.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
@@ -397,6 +398,10 @@ InitProcess(void)
 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
 	SHMQueueElemInit(&(MyProc->syncRepLinks));
 
+	/* initialize fields for fdw xact */
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	SHMQueueElemInit(&(MyProc->fdwXactLinks));
+
 	/* Initialize fields for group XID clearing. */
 	MyProc->procArrayGroupMember = false;
 	MyProc->procArrayGroupMemberXid = InvalidTransactionId;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e32901d..2ea4b1e 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2091,6 +2092,51 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolver_timeout", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Sets the maximum time to wait for foreign transaction resolution."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&foreign_xact_resolver_timeout,
+		60 * 1000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"max_foreign_transaction_resolvers", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Maximum number of foreign transaction resolution processes."),
+			NULL
+		},
+		&max_foreign_xact_resolvers,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolution_interval", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+		 gettext_noop("Sets the maximum interval between resolving foreign transactions."),
+		 NULL,
+		 GUC_UNIT_S
+		},
+		&foreign_xact_resolution_interval,
+		10, 0, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 69f40f0..c1c9868 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -117,6 +117,8 @@
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
 #work_mem = 4MB				# min 64kB
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 214dc71..af2c627 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ddc850d..1e8653e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,6 +208,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index cc73b7d..5b489c0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 9f93385..a923277 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -672,6 +672,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -893,6 +894,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..2e88496 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -11,6 +11,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdwxact.h b/src/include/access/fdwxact.h
new file mode 100644
index 0000000..b236d75
--- /dev/null
+++ b/src/include/access/fdwxact.h
@@ -0,0 +1,153 @@
+/*
+ * fdwxact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "access/xlogreader.h"
+#include "foreign/foreign.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "storage/backendid.h"
+#include "storage/shmem.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+#define	FDW_XACT_NOT_WAITING		0
+#define	FDW_XACT_WAITING			1
+#define	FDW_XACT_WAIT_COMPLETE		2
+
+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
+#define FdwXactEnabled() (max_prepared_foreign_xacts > 0)
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FdwXactData *FdwXact;
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED
+} FdwXactStatus;
+
+typedef struct FdwXactData
+{
+	FdwXact		fx_free_next;	/* Next free FdwXact entry */
+	FdwXact		fx_next;		/* Next FdwXact entry accosiated with the same
+								   transaction */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FdwXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FdwXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FdwXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction identifier */
+}	FdwXactData;
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FdwXactData structs */
+	FdwXact		freeFdwXacts;
+
+	/* Number of valid foreign transaction entries */
+	int			numFdwXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FdwXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FdwXactCtlData;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+ FdwXactCtlData *FdwXactCtl;
+
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;
+	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+}	FdwXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+/* GUC parameters */
+extern int	max_prepared_foreign_xacts;
+extern int	max_foreign_xact_resolvers;
+extern int	foreign_xact_resolution_interval;
+extern int	foreign_xact_resolver_timeout;
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern void RecoverFdwXacts(void);
+extern void FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool can_prepare,
+										 bool modify);
+extern TransactionId PrescanFdwXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void AtEOXact_FdwXacts(bool is_commit);
+extern void AtPrepare_FdwXacts(void);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFdwXacts(XLogRecPtr redo_horizon);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FdwXacts(void);
+extern void FdwXactRedoAdd(XLogReaderState *record);
+extern void FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFdwXactRecreateFiles(XLogRecPtr redo_horizon);
+extern void FdwXactWaitForResolution(TransactionId wait_xid, bool commit);
+extern int FdwXactResolveForeignTransactions(Oid dbid);
+extern int FdwXactResolveDanglingTransactions(Oid dbid);
+extern bool TwoPhaseCommitRequired(void);
+
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/fdwxact_resolver.h b/src/include/access/fdwxact_resolver.h
new file mode 100644
index 0000000..7ce1551
--- /dev/null
+++ b/src/include/access/fdwxact_resolver.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_resolver.h
+ *	  PostgreSQL foreign transaction resolver definitions
+ *
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_resolver.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_RESOLVER_H
+#define FDWXACT_RESOLVER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactRslvMain(Datum main_arg);
+extern Size FdwXactResolverShmemSize(void);
+extern void FdwXactResolverShmemInit(void);
+
+extern void fdwxact_resolver_attach(int slot);
+extern void fdwxact_maybe_launch_resolver(void);
+
+extern int foreign_xact_resolver_timeout;
+
+#endif		/* FDWXACT_RESOLVER_H */
diff --git a/src/include/access/resolver_private.h b/src/include/access/resolver_private.h
new file mode 100644
index 0000000..e649cbd
--- /dev/null
+++ b/src/include/access/resolver_private.h
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver_private.h
+ *	  Private definitions from access/transam/fdwxact/resolver.c
+ *
+ * Portions Copyright (c) 2017, PostgreSQL Global Development Group
+ *
+ * src/include/access/resolver_private.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef _RESOLVER_PRIVATE_H
+#define _RESOLVER_PRIVATE_H
+
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/timestamp.h"
+
+/*
+ * Each foreign transaction resolver has a FdwXactResolver struct in
+ * shared memory.  This struct is protected by FdwXactResolverLaunchLock.
+ */
+typedef struct FdwXactResolver
+{
+	pid_t	pid;	/* this resolver's PID, or 0 if not active */
+	Oid		dbid;	/* database oid */
+
+	/* Indicates if this slot is used of free */
+	bool	in_use;
+
+	/* Stats */
+	TimestampTz	last_resolution_time;
+
+	/* Protect shared variables shown above */
+	slock_t	mutex;
+
+	/*
+	 * Pointer to the resolver's patch. Used by backends to wake up this
+	 * resolver when it has work to do. NULL if the resolver isn't active.
+	 */
+	Latch	*latch;
+} FdwXactResolver;
+
+/* There is one FdwXactRslvCtlData struct for the whole database cluster */
+typedef struct FdwXactRslvCtlData
+{
+	/*
+	 * Foreign transaction resolution queue. Protected by FdwXactLock.
+	 */
+	SHM_QUEUE	FdwXactQueue;
+
+	FdwXactResolver resolvers[FLEXIBLE_ARRAY_MEMBER];
+} FdwXactRslvCtlData;
+
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+extern FdwXactResolver *MyFdwXactResolver;
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+#endif	/* _RESOLVER_PRIVATE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 2f43c19..62702de 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index f5fbbea..0359e9c 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -35,6 +35,7 @@ extern void PostPrepare_Twophase(void);
 
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid);
+extern bool	TwoPhaseExists(TransactionId xid);
 
 extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 7805c3c..5c1b839 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -227,6 +227,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 9e9e014..a521dc3 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -178,6 +178,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 830bab3..9027506 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2887,6 +2887,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f
 DESCR("statistics: information about WAL receiver");
 DATA(insert OID = 6118 (  pg_stat_get_subscription	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}" "{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}" _null_ _null_ pg_stat_get_subscription _null_ _null_ _null_ ));
 DESCR("statistics: information about subscription");
+DATA(insert OID = 4101 (  pg_stat_get_fdwxact_resolver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{26,26,26,1184}" "{o,o,o,o}" "{pid,dbid,n_entries,last_resolution_time}" _null_ _null_ pg_stat_get_fdwxact_resolver _null_ _null_ _null_ ));
+DESCR("statistics: information about subscription");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
 DESCR("statistics: current backend PID");
 DATA(insert OID = 1937 (  pg_stat_get_backend_pid		PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_ ));
@@ -4204,6 +4206,15 @@ DESCR("get the available time zone names");
 DATA(insert OID = 2730 (  pg_get_triggerdef		PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_triggerdef_ext _null_ _null_ _null_ ));
 DESCR("trigger description with pretty-print option");
 
+/* foreign transactions */
+DATA(insert OID = 4099 ( pg_prepared_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid, transaction,serverid,userid,status,identifier}" _null_ _null_ pg_prepared_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4100 ( pg_remove_fdw_xacts PGNSP PGUID 12 1 0 0 0 f f f f f f v u 4 0 2278 "28 26 26 26" _null_ _null_ "{transaction,dbid,serverid,userid}" _null_ _null_ pg_remove_fdw_xacts _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
+DATA(insert OID = 4126 ( pg_resolve_fdw_xacts	PGNSP PGUID 12 1 0 0 0 f f f f t f v s 0 0 16 "" _null_ _null_ _null_ _null_ _null_ pg_resolve_fdw_xacts _null_ _null_ _null_ ));
+DESCR("resolve foreign transactions");
+
+
 /* asynchronous notifications */
 DATA(insert OID = 3035 (  pg_listening_channels PGNSP PGUID 12 1 10 0 0 f f f f t t s r 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pg_listening_channels _null_ _null_ _null_ ));
 DESCR("get the channels that the current backend listens to");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..bbadc2e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -162,6 +162,18 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
 
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+										int *prep_info_len);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, const char *prep_id);
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															const char *prep_id);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -226,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for distributed transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 58f3a19..e954e6f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -831,7 +831,9 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_FDW_XACT_RESOLUTION,
+	WAIT_EVENT_FDW_XACT_RESOLVER_MAIN
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 1d37050..034fcb4 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -150,6 +150,16 @@ struct PGPROC
 	SHM_QUEUE	syncRepLinks;	/* list link if process is in syncrep queue */
 
 	/*
+	 * Info to allow us to wait for foreign transaction to be resolved, if
+	 * needed.
+	 */
+	TransactionId	waitXid;	/* waiting for foreign transaction involved with
+								 * this transaction id to be resolved */
+	int			fdwXactState;	/* wait state for foreign transaction
+								 * resolution */
+	SHM_QUEUE	fdwXactLinks;	/* list link if process is in queue */
+
+	/*
 	 * All PROCLOCK objects for locks held or awaited by this backend are
 	 * linked into one of these lists, according to the partition number of
 	 * their lock.
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index e31accf..09694ed 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/postgres_fdw
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/014_fdwxact.pl b/src/test/recovery/t/014_fdwxact.pl
new file mode 100644
index 0000000..28c9de9
--- /dev/null
+++ b/src/test/recovery/t/014_fdwxact.pl
@@ -0,0 +1,176 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
+max_foreign_transaction_resolvers = 2
+foreign_transaction_resolver_timeout = 0
+foreign_transaction_resolution_interval = 5s
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+							   has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs2->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign servers on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE EXTENSION postgres_fdw
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+));
+
+# Create user mapping on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+));
+
+# Create tables on foreign nodes and import them to the master node
+$node_fs1->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 (c int);
+));
+$node_fs2->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 (c int);
+));
+$node_master->safe_psql('postgres', qq(
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE l_table (c int);
+));
+
+# Switch to synchronous replication
+$node_master->safe_psql('postgres', qq(
+ALTER SYSTEM SET synchronous_standby_names ='*';
+));
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node. Check if we can commit and rollback the foreign transactions
+# after the normal recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->stop;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after recovery');
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node immediately. Check if we can commit and rollback the foreign
+# transactions after the crash recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the crash recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after crash recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after crash recovery');
+
+#
+# Commit transaction involving foreign servers and shutdown the master node
+# immediately before checkpoint. Check that WAL replay cleans up
+# its shared memory state release locks while replaying transaction commit.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (3);
+INSERT INTO t2 VALUES (3);
+COMMIT;
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', qq(
+SELECT count(*) FROM pg_prepared_fdw_xacts;
+));
+is($result, 0, "Cleanup of shared memory state for foreign transactions");
+
+#
+# Check if the standby node can process prepared foreign transaction
+# after promotion.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (4);
+INSERT INTO t2 VALUES (4);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (5);
+INSERT INTO t2 VALUES (5);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', qq(COMMIT PREPARED 'gxid1';));
+is($result, 0, 'Commit foreign transaction after promotion');
+$result = $node_standby->psql('postgres', qq(ROLLBACK PREPARED 'gxid2';));
+is($result, 0, 'Rollback foreign transaction after promotion');
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index f1c1b44..8aee1e0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1413,6 +1413,13 @@ pg_policies| SELECT n.nspname AS schemaname,
    FROM ((pg_policy pol
      JOIN pg_class c ON ((c.oid = pol.polrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
+pg_prepared_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_prepared_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_prepared_statements| SELECT p.name,
     p.statement,
     p.prepare_time,
@@ -1819,6 +1826,11 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
+pg_stat_fdwxact_resolvers| SELECT r.pid,
+    r.dbid,
+    r.last_resolution_time
+   FROM pg_stat_get_fdwxact_resolver() r(pid, dbid, n_entries, last_resolution_time)
+  WHERE (r.pid IS NOT NULL);
 pg_stat_progress_vacuum| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index e7ea3ae..70c0ea9 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2292,9 +2292,12 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts.  We also set max_prepared_foreign_transactions and
+		 * max_foreign_transaction_resolvers to enable testing of transaction
+		 * involving multiple foreign servers. (Note: to reduce the probability
+		 * of unexpected shmmax failures, don't set max_prepared_transactions
+		 * any higher than actually needed by the prepared_xacts regression
+		 * test.)
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2309,7 +2312,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
+		fputs("max_foreign_transaction_resolvers = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{
-- 
1.7.1

0003-postgres_fdw-supports-atomic-distributed-transaction_v14.patchapplication/octet-stream; name=0003-postgres_fdw-supports-atomic-distributed-transaction_v14.patchDownload

From 110b8b9d657601ff3ff47e1bc9fd29629fc24a12 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 26 Dec 2017 19:52:05 +0900
Subject: [PATCH 3/3] postgres_fdw supports atomic distributed transaction commit.

---
 contrib/postgres_fdw/connection.c              |  566 +++++++++++++++---------
 contrib/postgres_fdw/expected/postgres_fdw.out |  343 ++++++++++++++-
 contrib/postgres_fdw/option.c                  |    5 +-
 contrib/postgres_fdw/postgres_fdw.c            |   93 ++++-
 contrib/postgres_fdw/postgres_fdw.h            |   14 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  133 ++++++
 doc/src/sgml/postgres-fdw.sgml                 |   37 ++
 7 files changed, 952 insertions(+), 239 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4fbf043..df7c745 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,9 +14,11 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -73,13 +75,13 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values, UserMapping *user);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
@@ -91,24 +93,27 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 						 bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 						 PGresult **result);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit);
+static ConnCacheEntry *GetConnectionCacheEntry(Oid umid);
 
-
-/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetExistingConnection(Oid umid)
 {
-	bool		found;
-	ConnCacheEntry *entry;
-	ConnCacheKey key;
+	ConnCacheEntry	*entry;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	Assert(entry->conn != NULL);
+
+	return entry->conn;
+}
+
+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
+{
+	ConnCacheEntry	*entry;
+	ConnCacheKey	key;
+	bool			found;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
@@ -128,7 +133,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
-		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
 									  pgfdw_inval_callback, (Datum) 0);
@@ -136,11 +140,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 									  pgfdw_inval_callback, (Datum) 0);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -155,6 +156,28 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->conn = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
+{
+	ConnCacheEntry *entry;
+
+	/* Get connection cache entry from cache */
+	entry = GetConnectionCacheEntry(user->umid);
+
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
@@ -198,7 +221,16 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 								  ObjectIdGetDatum(user->umid));
 
 		/* Now try to make the connection */
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -207,7 +239,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -217,9 +254,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -265,11 +305,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						 errmsg("could not connect to server \"%s\"",
+								server->servername),
+						 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -414,15 +468,24 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer	*server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		FdwXactRegisterForeignServer(serverid, userid,
+									 server_uses_two_phase_commit(server),
+									 false);
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -644,193 +707,6 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
- */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
-{
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
-
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
-
-	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
-	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-	{
-		PGresult   *res;
-
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
-
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			bool		abort_cleanup_failure = false;
-
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
-
-			switch (event)
-			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-
-					/*
-					 * If abort cleanup previously failed for this connection,
-					 * we can't issue any more commands against it.
-					 */
-					pgfdw_reject_incomplete_xact_state_change(entry);
-
-					/* Commit all remote transactions during pre-commit */
-					entry->changing_xact_state = true;
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-					entry->changing_xact_state = false;
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-
-					/*
-					 * Don't try to clean up the connection if we're already
-					 * in error recursion trouble.
-					 */
-					if (in_error_recursion_trouble())
-						entry->changing_xact_state = true;
-
-					/*
-					 * If connection is already unsalvageable, don't touch it
-					 * further.
-					 */
-					if (entry->changing_xact_state)
-						break;
-
-					/*
-					 * Mark this connection as in the process of changing
-					 * transaction state.
-					 */
-					entry->changing_xact_state = true;
-
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-						!pgfdw_cancel_query(entry->conn))
-					{
-						/* Unable to cancel running query. */
-						abort_cleanup_failure = true;
-					}
-					else if (!pgfdw_exec_cleanup_query(entry->conn,
-													   "ABORT TRANSACTION",
-													   false))
-					{
-						/* Unable to abort remote transaction. */
-						abort_cleanup_failure = true;
-					}
-					else if (entry->have_prep_stmt && entry->have_error &&
-							 !pgfdw_exec_cleanup_query(entry->conn,
-													   "DEALLOCATE ALL",
-													   true))
-					{
-						/* Trouble clearing prepared statements. */
-						abort_cleanup_failure = true;
-					}
-					else
-					{
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-
-					/* Disarm changing_xact_state if it all worked. */
-					entry->changing_xact_state = abort_cleanup_failure;
-					break;
-			}
-		}
-
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
-
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-			entry->changing_xact_state)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			disconnect_pg_server(entry);
-		}
-	}
-
-	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-	 */
-	xact_got_connection = false;
-
-	/* Also reset cursor numbering for next transaction */
-	cursor_number = 0;
-}
-
-/*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
@@ -1193,3 +1069,255 @@ exit:	;
 		*result = last_res;
 	return timed_out;
 }
+
+/*
+ * The function prepares transaction on foreign server. This function
+ * is called only at the pre-commit phase of the local transaction. Since
+ * we should have the connection to the server that we are interested in
+ * we don't use serverid and userid that are necessary to get user mapping
+ * that is the key of the connection cache.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  const char *prep_id)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool		result;
+
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_id);
+
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		elog(DEBUG1, "prepare foreign transaction on server %u with ID %s",
+			 serverid, prep_id);
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or abort the transactionon foreign server. This
+ * function is called both at the pre-commit phase of the local transaction
+ * when committing and at the end of the local transaction when aborting.
+ * Since we should the connections to the server that involved with the local
+ * transaction we don't use serverid and userid that are necessary to get
+ * user mapping that is the key of connection cache.
+ */
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid,
+							  bool is_commit)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	/*
+	 * If abort cleanup previously failed for this connection, we can't issue
+	 * any more commands against it.
+	 */
+	if (is_commit)
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",	is_commit ? "COMMIT" : "ROLLBACK");
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or aborts prepared transaction on foreign server.
+ * This function could be called both at end of the local transaction and
+ * in a new transaction, for example, by the resolver process.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit, const char *prep_id)
+{
+	ConnCacheEntry *entry;
+	PGconn			*conn;
+
+	/*
+	 * If we are in a valid transaction state it means that we are trying to
+	 * resolve a transaction in a new transaction just before started and that
+	 * we might not have a connect to the server yet. So we use GetConnection
+	 * which establishes the connection if don't have it yet. This can happen when
+	 * the foreign transaction resolve process tries to resolve it. On the other
+	 * hand, if we are not in a valid transaction state it means that we are trying
+	 * to resolve a foreign transaction at end of the local transaction. Since we
+	 * should have the connection to the server we just get a connection cache entry.
+	 */
+	if (IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, false);
+	else
+	{
+		entry = GetConnectionCacheEntry(umid);
+
+		/* Reject further use of connections which failed abort cleanup */
+		if (is_commit)
+			pgfdw_reject_incomplete_xact_state_change(entry);
+
+		conn = entry->conn;
+	}
+
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%s'",
+						 is_commit ? "COMMIT" : "ROLLBACK",
+						 prep_id);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+
+			if (diag_sqlstate)
+			{
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
+			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
+		}
+		else
+			result = true;
+
+		elog(DEBUG1, "%s prepared foreign transaction on server %u with ID %s",
+			 is_commit ? "committed" : "aborted", serverid, prep_id);
+
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+
+	return false;
+}
+
+/* Cleanup at main-transaction end */
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit)
+{
+	if (entry->xact_depth > 0)
+	{
+		if (is_commit)
+		{
+			/*
+			 * If there were any errors in subtransactions, and we made prepared
+			 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+			 * prepared statements. This is annoying and not terribly bulletproof,
+			 * but it's probably not worth trying harder.
+			 *
+			 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+			 * old a server postgres_fdw can communicate with.	We intentionally
+			 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+			 * extent with older servers (leaking prepared statements as we go;
+			 * but we don't really support update operations pre-8.3 anyway).
+			 */
+			if (entry->have_prep_stmt && entry->have_error)
+			{
+				PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+				PQclear(res);
+			}
+
+			entry->have_prep_stmt = false;
+			entry->have_error = false;
+		}
+		else
+		{
+			/*
+			 * Don't try to clean up the connection if we're already in error
+			 * recursion trouble.
+			 */
+			if (in_error_recursion_trouble())
+				entry->changing_xact_state = true;
+
+			/* If connection is already unsalvageable, don't touch it further */
+
+			if (!entry->changing_xact_state)
+			{
+				entry->changing_xact_state = true;
+
+				if (entry->have_prep_stmt &&
+					!pgfdw_exec_cleanup_query(entry->conn, "DEALLOCATE ALL", true))
+				{
+					entry->changing_xact_state = true;
+				}
+			}
+		}
+		/* Reset state to show we're out of a transaction */
+		entry->xact_depth = 0;
+
+		/*
+		 * If the connection isn't in a good idle state, discard it to
+		 * recover. Next GetConnection will open a new connection.
+		 */
+		if (PQstatus(entry->conn) != CONNECTION_OK ||
+			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+		{
+			elog(DEBUG3, "discarding connection %p", entry->conn);
+			PQfinish(entry->conn);
+			entry->conn = NULL;
+		}
+	}
+
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641..629f0a2 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,13 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -82,6 +94,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +137,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -180,16 +202,19 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                                      List of foreign tables
- Schema |   Table    |  Server   |                   FDW options                    | Description 
---------+------------+-----------+--------------------------------------------------+-------------
- public | ft1        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft2        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
- public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+                                         List of foreign tables
+ Schema |      Table       |  Server   |                   FDW options                    | Description 
+--------+------------------+-----------+--------------------------------------------------+-------------
+ public | ft1              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft2              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft4              | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
+ public | ft5              | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft6              | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7_not_twophase | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8_twophase     | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9_twophase     | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft_pg_type       | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
+(9 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -7525,3 +7550,301 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 (4 rows)
 
 RESET enable_partition_wise_join;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+  srvname  
+-----------
+ loopback
+ loopback2
+ loopback3
+(3 rows)
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+ERROR:  null value in column "c1" violates not-null constraint
+DETAIL:  Failing row contains (null).
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+ERROR:  can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 67e1c59..67e1127 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..4c2562a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -348,6 +350,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+extern char*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 
 /*
  * Helper functions
@@ -420,7 +423,6 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
 				  const PgFdwRelationInfo *fpinfo_o,
 				  const PgFdwRelationInfo *fpinfo_i);
 
-
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
  * to my callback routines.
@@ -469,6 +471,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -476,6 +484,38 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 }
 
 /*
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
+ */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+	/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4ld_%d_%d",
+			 "px", Abs(random() * RANDOM_LARGE_MULTIPLIER),
+			 serverid, userid);
+
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
+
+/*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
@@ -1322,7 +1362,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1671,6 +1711,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	Oid			userid;
 	ForeignTable *table;
 	UserMapping *user;
+	ForeignServer *server;
 	AttrNumber	n_params;
 	Oid			typefnoid;
 	bool		isvarlena;
@@ -1698,9 +1739,15 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	user = GetUserMapping(userid, table->serverid);
+	server = GetForeignServer(user->serverid);
+
+	/* Remember this foreign server has been modified */
+	FdwXactRegisterForeignServer(user->serverid, user->userid,
+								 server_uses_two_phase_commit(server),
+								 true);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2272,6 +2319,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	RangeTblEntry *rte;
 	Oid			userid;
 	ForeignTable *table;
+	ForeignServer *server;
 	UserMapping *user;
 	int			numParams;
 
@@ -2298,12 +2346,18 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	dmstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(dmstate->rel));
 	user = GetUserMapping(userid, table->serverid);
+	server = GetForeignServer(user->serverid);
+
+	/* Remember this foreign server has been modified */
+	FdwXactRegisterForeignServer(user->serverid, user->userid,
+								 server_uses_two_phase_commit(server),
+								 true);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2564,7 +2618,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3501,7 +3555,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -3591,7 +3645,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -3814,7 +3868,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
@@ -5173,3 +5227,26 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
 	/* We didn't find any suitable equivalence class expression */
 	return NULL;
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..856ddf5 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdwxact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -115,7 +116,9 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
+extern PGconn *GetExistingConnection(Oid umid);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -123,6 +126,14 @@ extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
 extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, const char *prep_id);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid,
+										  Oid umid, bool is_commit);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  const char *prep_id);
+
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
@@ -177,6 +188,7 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern bool server_uses_two_phase_commit(ForeignServer *server);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c7..c27c0e6 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -88,6 +101,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +150,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1835,3 +1862,109 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
 
 RESET enable_partition_wise_join;
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 54b5e98..f065b7b 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -436,6 +436,43 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"/> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions"/>
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </sect3>
  </sect2>
 
  <sect2>
-- 
1.7.1

#171

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

about 8 years ago

In reply to: Masahiko Sawada (#170)

RE: [HACKERS] Transactions involving multiple postgres foreign servers

From: Masahiko Sawada [mailto:sawada.mshk@gmail.com]

I've updated documentation of patches, and fixed some bugs. I did some
failure tests of this feature using a fault simulation tool[1] for
PostgreSQL that I created.

0001 patch adds a mechanism to track of writes on local server. This is
required to determine whether we should use 2pc at commit. 0002 patch is
the main part. It adds a distributed transaction manager (currently only
for atomic commit), APIs for 2pc and foreign transaction manager resolver
process. 0003 patch makes postgres_fdw support atomic commit using 2pc.

Please review patches.

I'd like to join the review and testing of this functionality. First, some comments after taking a quick look at 0001:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

(2)
If TransactionDidWrite is necessary, I don't think you need to provide setter functions, because other transaction state variables are accessed globally without getter/setters. And you didn't create getter function for TransactionDidWrite.

(3)
heap_multi_insert() doesn't modify TransactionDidWrite. Is it sufficient to just remember heap modifications? Are other modifications on the coordinator node covered such as TRUNCATEand and REINDEX?

Questions before looking at 0002 and 0003:

Q1: Does this functionality work when combined with XA 2pc transactions?

Q2: Does the atomic commit cascade across multiple remote databases? For example:
* The local transaction modifies data on remote database 1 via a foreign table.
* A trigger fires on the remote database 1, which modifies data on remote database 2 via a foreign table.
* The local transaction commits.

Regards
Takayuki Tsunakawa

#172

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Tsunakawa, Takayuki (#171)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Thu, Dec 28, 2017 at 11:40 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Masahiko Sawada [mailto:sawada.mshk@gmail.com]

I've updated documentation of patches, and fixed some bugs. I did some
failure tests of this feature using a fault simulation tool[1] for
PostgreSQL that I created.

0001 patch adds a mechanism to track of writes on local server. This is
required to determine whether we should use 2pc at commit. 0002 patch is
the main part. It adds a distributed transaction manager (currently only
for atomic commit), APIs for 2pc and foreign transaction manager resolver
process. 0003 patch makes postgres_fdw support atomic commit using 2pc.

Please review patches.

I'd like to join the review and testing of this functionality. First, some comments after taking a quick look at 0001:

Thank you so much!

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
we did any writes on local node which requires the atomic commit. Will
fix.

(2)
If TransactionDidWrite is necessary, I don't think you need to provide setter functions, because other transaction state variables are accessed globally without getter/setters. And you didn't create getter function for TransactionDidWrite.

(3)
heap_multi_insert() doesn't modify TransactionDidWrite. Is it sufficient to just remember heap modifications? Are other modifications on the coordinator node covered such as TRUNCATEand and REINDEX?

I think the using (XactLastRecEnd != 0 && markXidCommitted) to check
if we did any writes on local node would be better. After changed I
will be able to deal with the all above concerns.

Questions before looking at 0002 and 0003:

Q1: Does this functionality work when combined with XA 2pc transactions?

All transaction including local transaction and foreign transactions
are prepared when PREPARE. And all transactions are
committed/rollbacked by the foreign transaction resolver process when
COMMIT/ROLLBACK PREPARED.

Q2: Does the atomic commit cascade across multiple remote databases? For example:
* The local transaction modifies data on remote database 1 via a foreign table.
* A trigger fires on the remote database 1, which modifies data on remote database 2 via a foreign table.
* The local transaction commits.

I've not tested yet more complex failure situations but as far as I
tested on my environment, the cascading atomic commit works. I'll test
these cases more deeply.

Regards,

Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#173

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

about 8 years ago

In reply to: Masahiko Sawada (#172)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
we did any writes on local node which requires the atomic commit. Will
fix.

I haven't checked how much code it needs to track whether the local
transaction wrote anything. But probably we can post-pone this
optimization. If it's easy to incorporate, it's good to have in the
first set itself.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#174

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Ashutosh Bapat (#173)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Mon, Jan 1, 2018 at 7:12 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Thu, Dec 28, 2017 at 11:08 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

Perhaps we can use (XactLastRecEnd != 0 && markXidCommitted) to see if
we did any writes on local node which requires the atomic commit. Will
fix.

I haven't checked how much code it needs to track whether the local
transaction wrote anything. But probably we can post-pone this
optimization. If it's easy to incorporate, it's good to have in the
first set itself.

Without the track of local writes, we always have to use two-phase
commit even when the transaction write data on only one foreign
server. It will be cause of unnecessary performance degradation and
cause of transaction failure on existing systems. We can set the using
two-phase commit per foreign server by ALTER SERVER but it will affect
other transactions. If we can configure it per transaction perhaps it
will be worth to postpone.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#175

Robert Haas

robertmhaas@gmail.com

about 8 years ago

In reply to: Tsunakawa, Takayuki (#171)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Wed, Dec 27, 2017 at 9:40 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

If I understand correctly, XactLastRecEnd can be set by, for example,
a HOT cleanup record, so that doesn't seem like a good thing to use.
Whether we need to use 2PC across remote nodes seems like it shouldn't
depend on whether a local SELECT statement happened to do a HOT
cleanup or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#176

Masahiko Sawada

sawada.mshk@gmail.com

about 8 years ago

In reply to: Robert Haas (#175)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Jan 9, 2018 at 11:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 27, 2017 at 9:40 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

(1)
Why don't you use the existing global variable MyXactFlags instead of the new TransactionDidWrite? Or, how about using XactLastRecEnd != 0 to determine the transaction did any writes? When the transaction only modified temporary tables on the local database and some data on one remote database, I think 2pc is unnecessary.

If I understand correctly, XactLastRecEnd can be set by, for example,
a HOT cleanup record, so that doesn't seem like a good thing to use.

Yes, that's right.

Whether we need to use 2PC across remote nodes seems like it shouldn't
depend on whether a local SELECT statement happened to do a HOT
cleanup or not.

So I think we need to check if the top transaction is invalid or not as well.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#177

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Masahiko Sawada (#176)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Jan 9, 2018 at 9:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If I understand correctly, XactLastRecEnd can be set by, for example,
a HOT cleanup record, so that doesn't seem like a good thing to use.

Yes, that's right.

Whether we need to use 2PC across remote nodes seems like it shouldn't
depend on whether a local SELECT statement happened to do a HOT
cleanup or not.

So I think we need to check if the top transaction is invalid or not as well.

Even if you check both, it doesn't sound like it really does what you
want. Won't you still end up partially dependent on whether a HOT
cleanup happened, if not in quite the same way as before? How about
defining a new bit in MyXactFlags for XACT_FLAGS_WROTENONTEMPREL?
Just have heap_insert, heap_update, and heap_delete do something like:

if (RelationNeedsWAL(relation))
MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;

Overall, what's the status of this patch? Are we hung up on this
issue only, or are there other things?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#178

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Robert Haas (#177)

3 attachment(s)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Thu, Feb 8, 2018 at 3:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 9, 2018 at 9:49 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

If I understand correctly, XactLastRecEnd can be set by, for example,
a HOT cleanup record, so that doesn't seem like a good thing to use.

Yes, that's right.

Whether we need to use 2PC across remote nodes seems like it shouldn't
depend on whether a local SELECT statement happened to do a HOT
cleanup or not.

So I think we need to check if the top transaction is invalid or not as well.

Even if you check both, it doesn't sound like it really does what you
want. Won't you still end up partially dependent on whether a HOT
cleanup happened, if not in quite the same way as before? How about
defining a new bit in MyXactFlags for XACT_FLAGS_WROTENONTEMPREL?
Just have heap_insert, heap_update, and heap_delete do something like:

if (RelationNeedsWAL(relation))
MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;

Agreed.

Overall, what's the status of this patch? Are we hung up on this
issue only, or are there other things?

AFAIK there is no more technical issue in this patch so far other than
this issue. The patch has tests and docs, and includes all stuff to
support atomic commit to distributed transactions: the introducing
both the atomic commit ability to distributed transactions and some
corresponding FDW APIs, and having postgres_fdw support 2pc. I think
this patch needs to be reviewed, especially the functionality of
foreign transaction resolution which is re-designed before.

The previous patches doesn't apply cleanly to current HEAD and I've
fixed some issues. Attached latest patch set.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

0001-Keep-track-of-writing-on-non-temporary-relation_v15.patchapplication/octet-stream; name=0001-Keep-track-of-writing-on-non-temporary-relation_v15.patchDownload

From a1a63e5cc27617ef6f4ad9de94e1ddf0a746629e Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Thu, 8 Feb 2018 11:26:46 +0900
Subject: [PATCH 1/3] Keep track of writing on non-temporary relation.

---
 src/backend/access/heap/heapam.c |   12 ++++++++++++
 src/include/access/xact.h        |    5 +++++
 2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8a846e7..6142af4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2606,6 +2606,10 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 		heap_freetuple(heaptup);
 	}
 
+	/* Make note that we've wrote on non-temprary relation */
+	if (RelationNeedsWAL(relation))
+		MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
 	return HeapTupleGetOid(tup);
 }
 
@@ -3425,6 +3429,10 @@ l1:
 	if (old_key_tuple != NULL && old_key_copied)
 		heap_freetuple(old_key_tuple);
 
+	/* Make note that we've wrote on non-temprary relation */
+	if (RelationNeedsWAL(relation))
+		MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
 	return HeapTupleMayBeUpdated;
 }
 
@@ -4366,6 +4374,10 @@ l2:
 	if (old_key_tuple != NULL && old_key_copied)
 		heap_freetuple(old_key_tuple);
 
+	/* Make note that we've wrote on non-temprary relation */
+	if (RelationNeedsWAL(relation))
+		MyXactFlags |= XACT_FLAGS_WROTENONTEMPREL;
+
 	bms_free(hot_attrs);
 	bms_free(key_attrs);
 	bms_free(id_attrs);
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 6445bbc..9be79e2 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -91,6 +91,11 @@ extern int	MyXactFlags;
  */
 #define XACT_FLAGS_ACQUIREDACCESSEXCLUSIVELOCK	(1U << 1)
 
+/*
+ * XACT_FLAGS_WROTENONTEMPREL - set when we wrote data on non-temporary
+ * relation.
+ */
+#define XACT_FLAGS_WROTENONTEMPREL				(1U << 2)
 
 /*
  *	start- and end-of-transaction callbacks for dynamically loaded modules
-- 
1.7.1

0002-Support-atomic-commit-involving-multiple-foreign-ser_v15.patchapplication/octet-stream; name=0002-Support-atomic-commit-involving-multiple-foreign-ser_v15.patchDownload

From 18eb819c3a5af3a502d8f0a7859d4fae92ab264d Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Sat, 21 Oct 2017 16:05:35 +0900
Subject: [PATCH 2/3] Support atomic commit involving multiple foreign servers.

---
 doc/src/sgml/catalogs.sgml                    |   97 +
 doc/src/sgml/config.sgml                      |   89 +
 doc/src/sgml/fdwhandler.sgml                  |  142 ++
 doc/src/sgml/func.sgml                        |   51 +
 doc/src/sgml/monitoring.sgml                  |   44 +
 src/backend/access/rmgrdesc/Makefile          |    8 +-
 src/backend/access/rmgrdesc/fdwxactdesc.c     |   66 +
 src/backend/access/rmgrdesc/xlogdesc.c        |    6 +-
 src/backend/access/transam/Makefile           |    6 +-
 src/backend/access/transam/fdwxact.c          | 2513 +++++++++++++++++++++++++
 src/backend/access/transam/fdwxact_resolver.c |  518 +++++
 src/backend/access/transam/rmgr.c             |    1 +
 src/backend/access/transam/twophase.c         |   42 +
 src/backend/access/transam/xact.c             |   26 +-
 src/backend/access/transam/xlog.c             |   19 +-
 src/backend/catalog/system_views.sql          |   11 +
 src/backend/commands/foreigncmds.c            |   20 +
 src/backend/postmaster/bgworker.c             |    4 +
 src/backend/postmaster/pgstat.c               |    6 +
 src/backend/postmaster/postmaster.c           |    5 +
 src/backend/replication/logical/decode.c      |    1 +
 src/backend/storage/ipc/ipci.c                |    6 +
 src/backend/storage/lmgr/lwlocknames.txt      |    2 +
 src/backend/storage/lmgr/proc.c               |    8 +
 src/backend/utils/misc/guc.c                  |   46 +
 src/backend/utils/misc/postgresql.conf.sample |    2 +
 src/backend/utils/probes.d                    |    2 +
 src/bin/initdb/initdb.c                       |    1 +
 src/bin/pg_controldata/pg_controldata.c       |    2 +
 src/bin/pg_resetwal/pg_resetwal.c             |    2 +
 src/bin/pg_waldump/rmgrdesc.c                 |    1 +
 src/include/access/fdwxact.h                  |  134 ++
 src/include/access/fdwxact_resolver.h         |   27 +
 src/include/access/fdwxact_xlog.h             |   51 +
 src/include/access/resolver_private.h         |   61 +
 src/include/access/rmgrlist.h                 |    1 +
 src/include/access/twophase.h                 |    1 +
 src/include/access/xlog_internal.h            |    1 +
 src/include/catalog/pg_control.h              |    1 +
 src/include/catalog/pg_proc.h                 |   11 +
 src/include/foreign/fdwapi.h                  |   18 +
 src/include/pgstat.h                          |    4 +-
 src/include/storage/proc.h                    |   10 +
 src/test/recovery/Makefile                    |    2 +-
 src/test/recovery/t/015_fdwxact.pl            |  176 ++
 src/test/regress/expected/rules.out           |   12 +
 src/test/regress/pg_regress.c                 |   13 +-
 47 files changed, 4251 insertions(+), 19 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/fdwxactdesc.c
 create mode 100755 src/backend/access/transam/fdwxact.c
 create mode 100644 src/backend/access/transam/fdwxact_resolver.c
 create mode 100644 src/include/access/fdwxact.h
 create mode 100644 src/include/access/fdwxact_resolver.h
 create mode 100644 src/include/access/fdwxact_xlog.h
 create mode 100644 src/include/access/resolver_private.h
 create mode 100644 src/test/recovery/t/015_fdwxact.pl

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 71e20f2..ffb7079 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -9558,6 +9558,103 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
 
  </sect1>
 
+ <sect1 id="view-pg-prepared-fdw-xacts">
+  <title><structname>pg_prepared_fdw_xacts</structname></title>
+
+  <indexterm zone="view-pg-prepared-fdw-xacts">
+   <primary>pg_prepared_fdw_xacts</primary>
+  </indexterm>
+
+  <para>
+   The view <structname>pg_prepared_fdw_xacts</structname> displays
+   information about foreign transactions that are currently prepared on
+   foreign servers for atomic distributed transaction commit (see
+   <xref linkend="fdw-transactions"/> for details).
+  </para>
+
+  <para>
+   <structname>pg_prepared_xacts</structname> contains one row per prepared
+   foreign transaction.  An entry is removed when the foreign transaction is
+   committed or rolled back.
+  </para>
+
+  <table>
+   <title><structname>pg_prepared_fdw_xacts</structname> Columns</title>
+
+   <tgroup cols="4">
+    <thead>
+     <row>
+      <entry>Name</entry>
+      <entry>Type</entry>
+      <entry>References</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry><structfield>dbid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-database"><structname>pg_database</structname></link>.oid</literal></entry>
+      <entry>
+       OID of the database which the foreign transaction resides in
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>transaction</structfield></entry>
+      <entry><type>xid</type></entry>
+      <entry></entry>
+      <entry>
+       Transaction id that this foreign transaction associates with
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>serverid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="catalog-pg-foreign-server"><structname>pg_foreign_server</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the foreign server that this foreign server is prepared
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>userid</structfield></entry>
+      <entry><type>oid</type></entry>
+      <entry><literal><link linkend="view-pg-user"><structname>pg_user</structname></link>.oid</literal></entry>
+      <entry>
+       The OID of the user that prepared this foreign transaction.
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>status</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       Status of foreign transaction: <literal>prepared</literal>, <literal>committing</literal>, <literal>aborting</literal> or <literal>unknown</literal>
+      </entry>
+     </row>
+     <row>
+      <entry><structfield>identifier</structfield></entry>
+      <entry><type>text</type></entry>
+      <entry></entry>
+      <entry>
+       The identifier of the prepared foreign transaction.
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   When the <structname>pg_prepared_xacts</structname> view is accessed, the
+   internal transaction manager data structures are momentarily locked, and
+   a copy is made for the view to display.  This ensures that the
+   view produces a consistent set of results, while not blocking
+   normal operations longer than necessary.  Nonetheless
+   there could be some impact on database performance if this view is
+   frequently accessed.
+  </para>
+
+ </sect1>
+
  <sect1 id="view-pg-publication-tables">
   <title><structname>pg_publication_tables</structname></title>
 
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index c45979d..9516ff1 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1485,6 +1485,25 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-max-prepared-foreign-transactions" xreflabel="max_prepared_foreign_transactions">
+      <term><varname>max_prepared_foreign_transactions</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>max_prepared_foreign_transactions</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the maximum number of foreign transactions that can be prepared
+        simultaneously. This parameter can only be set at server start.
+       </para>
+       <para>
+        When running a standby server, you must set this parameter to the
+        same or higher value than on the master server. Otherwise, queries
+        will not be allowed in the standby server.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-work-mem" xreflabel="work_mem">
       <term><varname>work_mem</varname> (<type>integer</type>)
       <indexterm>
@@ -3541,6 +3560,76 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
      </variablelist>
     </sect2>
 
+    <sect2 id="runtime-config-foreign-transaction-resolver">
+     <title>Foreign Transaction Resolvers</title>
+
+     <para>
+      These settings control the behavior of a foreign transaction resolver.
+     </para>
+
+     <variablelist>
+
+     <varlistentry id="guc-max-foreign-transaction-resolvers" xreflabel="max_foreign_transaction_resolvers">
+      <term><varname>max_foreign_transaction_resolvers</varname> (<type>int</type>)
+      <indexterm>
+       <primary><varname>max_foreign_transaction_resolvers</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies maximum number of foreign transaction resolution workers. A foreign transaction
+        resolver is responsible for resolution of foreign transactions on one database.
+       </para>
+       <para>
+        Foreign transaction resolution workers are taken from the pool defined by
+        <varname>max_worker_processes</varname>.
+       </para>
+       <para>
+        The default value is 0.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolution-interval" xreflabel="foreign_transaction_resolution_intervalription">
+      <term><varname>foreign_transaction_resolution_interval</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolution_interval</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specify how long the foreign transaction resolver should wait before trying to resolve
+        foreign transaction. This parameter can only be set in the
+        <filename>postgresql.conf</filename> file or on the server command line.
+       </para>
+       <para>
+        The default value is 10 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="foreign-transaction-resolver-timeout" xreflabel="foreign_transaction_resolver_timeout">
+      <term><varname>foreign_transaction_resolver_timeout</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>foreign_transaction_resolver_timeout</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Terminate foreign transaction resolver processes that don't have any foreign
+        transactions to resolve longer than the specified number of milliseconds.
+        A value of zero disables the timeout mechanism.  This parameter can only be set in
+        the <filename>postgresql.conf</filename> file or on the server command line.
+       </para>
+       <para>
+        The default value is 60 seconds.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     </variablelist>
+    </sect2>
+
    </sect1>
 
    <sect1 id="runtime-config-query">
diff --git a/doc/src/sgml/fdwhandler.sgml b/doc/src/sgml/fdwhandler.sgml
index 0ed3a47..4294ccc 100644
--- a/doc/src/sgml/fdwhandler.sgml
+++ b/doc/src/sgml/fdwhandler.sgml
@@ -1315,6 +1315,61 @@ ReparameterizeForeignPathByChild(PlannerInfo *root, List *fdw_private,
     </para>
    </sect2>
 
+   <sect2 id="fdw-callbacks-transaction-managements">
+    <title>FDW Routines For transaction managements</title>
+
+    <para>
+     If an FDW wishes to support <firstterm>atomic distributed transaction commit</firstterm>
+     (as described in <xref linkend="fdw-transactions"/>), it must provide
+     the following callback functions:
+    </para>
+
+    <para>
+<programlisting>
+char *
+GetPreparedId(Oid serverid, Oid userid, int *prep_info_len);
+</programlisting>
+    Generate prepared transaction identifier. Note that the transaction
+    identifier must be string literal, less than 200 bytes long and
+    should not be same as any other concurrent prepared transaction
+    Id.
+    </para>
+    <para>
+<programlisting>
+bool
+PrepareForignTransaction(Oid serverid, Oid userid, Oid unmid,
+                         const char *prep_id);
+</programlisting>
+    Prepare foreign transaction on foreign server. This function is called
+    before local transaction commit at the pre-commit phase of the local
+    transaction. Returning true means that prepareing the foreign transaction
+    got successful.
+    </para>
+    <para>
+<programlisting>
+bool
+EndForeignTransaction(Oid serverid, Oid userid, Oid unmid,
+                      const char *prep_id);
+</programlisting>
+    Commit or rollback the foreign transaction on foreign server. For foreign
+    servers that supports two-phase commit protocol, this function is called
+    both at the pre-commit phase of the local transaction when committing
+    and at the end of the local transaction when aborting. For foreign servers
+    that don't support two-phase commit protocol, this function is called
+    at the pre-commit phase of the local tranasaction.
+    </para>
+    <para>
+<programlisting>
+bool
+ResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+                                  Oid umid, bool is_commit,
+                                  const char *prep_id)l
+</programlisting>
+    Commit or rollback the prepared foreign transaction on foreign server.
+    This function is called both by foreign transaction resolver process after
+    prepared foreign transaction and by <function>pg_resolve_fdw_xacts</function>
+    </para>
+   </sect2>
    </sect1>
 
    <sect1 id="fdw-helpers">
@@ -1760,4 +1815,91 @@ GetForeignServerByName(const char *name, bool missing_ok);
 
   </sect1>
 
+  <sect1 id="fdw-transactions">
+    <title>Transaction manager for Foreign Data Wrappers</title>
+
+    <para>
+    <productname>PostgreSQL</productname> transaction manager allows FDWs to read
+    and write data on foreign server within a transaction while maintaining atomicity
+    of the foreign data. Every Foreign Data Wrapper is required to register the foreign
+    server along with the <productname>PostgreSQL</productname> user whose user mapping
+    is used to connect to the foreign server while starting a transaction on the foreign
+    server as part of the transaction on <productname>PostgreSQL</productname> using
+    <function>RegisterXactForeignServer</function>.
+<programlisting>
+void
+FdwXactRegisterForeignServer(Oid serverid,
+                             Oid userid,
+                             bool can _prepare,
+                             bool modify)
+</programlisting>
+    <varname>can_prepare</varname> should be true if the foreign server supports
+    two-phase commit protocol, false otherwise. <varname>modify</varname> should be
+    true if you're attempting to modify data on foreign server in current transaction.
+    </para>
+
+    <para>
+    An example of such transaction is as follows
+<programlisting>
+BEGIN;
+UPDATE ft1 SET col = 'a';
+UPDATE ft2 SET col = 'b';
+COMMIT;
+</programlisting>
+    ft1 and ft2 are foreign tables on different foreign servers may be using different
+    Foreign Data Wrappers.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transactions</varname> is more than zero
+    <productname>PostgreSQL</productname> employs Two-phase commit protocol to
+    achieve atomic distributed transaction commit. All the foreign servers registered
+    should support two-phase commit protocol. The two-phase commit protocol is
+    used for achieving atomic distributed transaction commit when more than two foreign
+    servers that support two-phase commit protocol are involved with transaction,
+    or when transaction involves with one foreign server that support two-phase commit
+    protocol and changes on local data. In other case, for example where only one
+    foreign server that support two-phase commit protocol is involved with transaction,
+    the two-phase commit protocol is not used.  In Two-phase commit protocol
+    the commit is processed in two phases: prepare phase and commit phase.
+    In prepare phase, <productname>PostgreSQL</productname> prepares the transactions
+    on all the foreign servers registered using
+    <function>FdwXactPrepareForeignTransactions</function>. If any of the foreign
+    server fails to prepare transaction, prepare phase fails. In commit phase,
+    all the prepared transactions are committed by foreign transaction resolver
+    process if prepare phase has succeeded or rolled back if prepare phase fails
+    to prepare transactions on all the foreign servers.
+    </para>
+
+    <para>
+    During prepare phase the distributed transaction manager calls
+    <function>GetPrepareId</function> to get the prepared transaction
+    identifier for each foreign server involved. It stores this identifier along
+    with the serverid, userid and user mapping id for later use. It then calls
+    <function>PrepareForeignTransaction</function> with the same identifier.
+    </para>
+
+    <para>
+    During commit phase the distributed transaction manager calls
+    <function>ResolveForeignTransaction</function> with the same identifier with
+    action FDW_XACT_COMMITTING_PREPARED to commit the prepared transaction or
+    FDW_XACT_ABORTING_PREPARED to rollback the prepared transaction. In case the
+    distributed transaction manager fails to commit or rollback a prepared
+    transaction because of connection failure, the operation can be tried again
+    through built-in <function>pg_resolve_fdw_xacts</function>, or by foreign
+    transaction resolver process if it's working.
+    </para>
+
+    <para>
+    When <varname>max_prepared_foreign_transaction</varname> is zero, atomicity
+    commit can not be guaranteed across foreign servers. If transaction on
+    <productname>PostgreSQL</productname> is committed, distributed transaction
+    manager commit the transaction on all the foreign servers registered using
+    <function>FdwXactRegisterForeignServer</function>, independent of the outcome
+    of the same operation on other foreign servers. Thus transactions on some
+    foreign servers may be committed, while the same on other foreign servers
+    would be rolled back. If the transaction on <productname>PostgreSQL</productname>
+    aborts transactions on all the foreign servers are aborted too.
+    </para>
+  </sect1>
  </chapter>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 640ff09..76f3c12 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -20370,6 +20370,57 @@ SELECT (pg_stat_file('filename')).modification;
 
   </sect2>
 
+  <sect2 id="functions-fdw-transaction">
+   <title>Foreign Transaction Management Functions</title>
+
+   <indexterm>
+    <primary>pg_resolve_fdw_xacts</primary>
+   </indexterm>
+   <indexterm>
+    <primary>pg_remove_fdw_xacts</primary>
+   </indexterm>
+
+   <para>
+    <xref linkend="functions-fdw-transaction-table"/> shows the functions
+    available for foreign transaction managements.
+    These functions cannot be executed during recovery. Use of these function
+    is restricted to superusers.
+   </para>
+
+   <table id="functions-fdw-transaction-table">
+    <title>Foreign Transaction Management Functions</title>
+    <tgroup cols="3">
+     <thead>
+      <row><entry>Name</entry> <entry>Return Type</entry> <entry>Description</entry></row>
+     </thead>
+
+     <tbody>
+      <row>
+       <entry>
+        <literal><function>pg_resolve_fdw_xact(<parameter>transaction</parameter> <type>xid</type>, <parameter>userid</parameter> <type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>bool</type></entry>
+       <entry>
+        Resolve a foreign transaction. This function search foreign transaction
+        matching the criteria and resolve then. This function doesn't resolve
+        an entry of which transaction is in-progress, or that is locked by some
+        other backend.
+       </entry>
+      </row>
+      <row>
+       <entry>
+        <literal><function>pg_remove_fdw_xact(<parameter>transaction</parameter> <type>xid</type>, <parameter>serverid</parameter> <type>oid</type>, <parameter>userid</parameter> <type>oid</type>)</function></literal>
+       </entry>
+       <entry><type>void</type></entry>
+       <entry>
+        This function works the same as <function>pg_resolve_fdw_xact</function>
+        except it remove foreign transaction entry without resolving.
+       </entry>
+      </row>
+     </tbody>
+    </tgroup>
+   </table>
+  </sect2>
   </sect1>
 
   <sect1 id="functions-trigger">
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e138d1e..6d28eb6 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_fdw_xact_resolver</structname><indexterm><primary>pg_stat_fdw_xact_resolver</primary></indexterm></entry>
+      <entry>One row per foreign transaction resolver process, showing statistics about
+       foreign transaction resolution. See <xref linkend="pg-stat-foreign-xact-resolver-view"/> for
+       details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -2186,6 +2194,42 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    connection.
   </para>
 
+  <table id="pg-stat-foreign-xact-resolver-view" xreflabel="pg_stat_fdw_xact_resolver">
+   <title><structname>pg_stat_fdw_xact_resolver</structname> View</title>
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of a foreign transaction resolver process</entry>
+    </row>
+    <row>
+     <entry><structfield>dbid</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>OID of the database that the foreign transaction resolver process connects to</entry>
+    </row>
+    <row>
+     <entry><structfield>last_resolved_time</structfield></entry>
+     <entry><type>timestamp with time zone</type></entry>
+     <entry>Time of last resolved a foreign transaction</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_fdw_xact_resolver</structname> view will contain one
+   row per foreign transaction resolver process, showing state of resolution
+   of foreign trasactions.
+  </para>
 
   <table id="pg-stat-archiver-view" xreflabel="pg_stat_archiver">
    <title><structname>pg_stat_archiver</structname> View</title>
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..742e825 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -8,9 +8,9 @@ subdir = src/backend/access/rmgrdesc
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
-	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
-	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o fdwxactdesc.o \
+	genericdesc.o  gindesc.o gistdesc.o hashdesc.o heapdesc.o \
+	logicalmsgdesc.o mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o \
+	seqdesc.o smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/fdwxactdesc.c b/src/backend/access/rmgrdesc/fdwxactdesc.c
new file mode 100644
index 0000000..d543c8a
--- /dev/null
+++ b/src/backend/access/rmgrdesc/fdwxactdesc.c
@@ -0,0 +1,66 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdw_xactdesc.c
+ *		PostgreSQL distributed transaction manager for foreign server.
+ *
+ * This module describes the WAL records for foreign transaction manager.
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdw_xactdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/fdwxact_xlog.h"
+
+void
+fdw_xact_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+	{
+		FdwXactOnDiskData *fdw_insert_xlog = (FdwXactOnDiskData *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_insert_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_insert_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_insert_xlog->dboid);
+		appendStringInfo(buf, " umid: %u", fdw_insert_xlog->umid);
+		appendStringInfo(buf, " local xid: %u", fdw_insert_xlog->local_xid);
+		/* TODO: This should be really interpreted by each FDW */
+
+		/*
+		 * TODO: we also need to assess whether we want to add this
+		 * information
+		 */
+		appendStringInfo(buf, " foreign transaction info: %s",
+						 fdw_insert_xlog->fdw_xact_id);
+	}
+	else
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		appendStringInfo(buf, "Foreign server oid: %u", fdw_remove_xlog->serverid);
+		appendStringInfo(buf, " user oid: %u", fdw_remove_xlog->userid);
+		appendStringInfo(buf, " database id: %u", fdw_remove_xlog->dbid);
+		appendStringInfo(buf, " local xid: %u", fdw_remove_xlog->xid);
+	}
+
+}
+
+const char *
+fdw_xact_identify(uint8 info)
+{
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_FDW_XACT_INSERT:
+			return "NEW FOREIGN TRANSACTION";
+		case XLOG_FDW_XACT_REMOVE:
+			return "REMOVE FOREIGN TRANSACTION";
+	}
+	/* Keep compiler happy */
+	return NULL;
+}
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 00741c7..023a7c5 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -112,14 +112,16 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 		appendStringInfo(buf, "max_connections=%d max_worker_processes=%d "
 						 "max_prepared_xacts=%d max_locks_per_xact=%d "
 						 "wal_level=%s wal_log_hints=%s "
-						 "track_commit_timestamp=%s",
+						 "track_commit_timestamp=%s "
+						 "max_prepared_foreign_xacts=%d",
 						 xlrec.MaxConnections,
 						 xlrec.max_worker_processes,
 						 xlrec.max_prepared_xacts,
 						 xlrec.max_locks_per_xact,
 						 wal_level_str,
 						 xlrec.wal_log_hints ? "on" : "off",
-						 xlrec.track_commit_timestamp ? "on" : "off");
+						 xlrec.track_commit_timestamp ? "on" : "off",
+						 xlrec.max_prepared_foreign_xacts);
 	}
 	else if (info == XLOG_FPW_CHANGE)
 	{
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47..90d0056 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -12,9 +12,9 @@ subdir = src/backend/access/transam
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
-	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
-	xact.o xlog.o xlogarchive.o xlogfuncs.o \
+OBJS = clog.o commit_ts.o fdwxact.o fdwxact_resolver.o generic_xlog.o multixact.o \
+	parallel.o rmgr.o slru.o subtrans.o timeline.o transam.o twophase.o \
+	twophase_rmgr.o varsup.o xact.o xlog.o xlogarchive.o xlogfuncs.o \
 	xloginsert.o xlogreader.o xlogutils.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/fdwxact.c b/src/backend/access/transam/fdwxact.c
new file mode 100755
index 0000000..dbdbcd7
--- /dev/null
+++ b/src/backend/access/transam/fdwxact.c
@@ -0,0 +1,2513 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact.c
+ *		PostgreSQL distributed transaction manager for foreign servers.
+ *
+ * This module manages the transactions involving foreign servers.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/backend/access/transam/fdwxact.c
+ *
+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign servers.
+ *
+ * When an foreign data wrapper starts transaction on a foreign server. it is
+ * required to register the foreign server and user who initiated the
+ * transaction using function RegisterXactForeignServer(). A foreign server
+ * connection is identified by oid fo foreign server and user.
+ *
+ * The commit is executed in two phases. In the first phase executed during
+ * pre-commit phase, transactions are prepared on all the foreign servers,
+ * which can participate in two-phase commit protocol. Transaction on other
+ * foreign servers are committed in the same phase. In the second phase, if
+ * first phase doesn not succeed for whatever reason, the foreign servers
+ * are asked to rollback respective prepared transactions or abort the
+ * transactions if they are not prepared. This process is executed by backend
+ * process that executed the first phase. If the first phase succeeds, the
+ * backend process registers ourselves to the queue in the shared memory and then
+ * ask the foreign transaction resolver process to resolve foreign transactions
+ * that are associated with the its transaction. After resolved all foreign
+ * transactions by foreign transaction resolve process the backend wakes up
+ * and resume to process.
+ *
+ * Any network failure, server crash after preparing foreign transaction leaves
+ * that prepared transaction unresolved (aka dangling transaction). During the
+ * first phase, before actually preparing the transactions, enough information
+ * is persisted to the disk and logs in order to resolve such transactions.
+ *
+ * During replay WAL and replication FdwXactCtl also holds information about
+ * active prepared foreign transaction that haven't been moved to disk yet.
+ *
+ * Replay of fdwxact records happens by the following rules:
+ *
+ * 	* On PREPARE redo we add the foreign transaction to FdwXactCtl->fdw_xacts.
+ *	  We set fdw_xact->inredo to true for such entries.
+ *	* On Checkpoint redo we iterate through FdwXactCtl->fdw_xacts entries that
+ *	  that have fdw_xact->inredo set and are behind the redo_horizon.
+ *	  We save them to disk and alos set fdw_xact->ondisk to true.
+ *	* On COMMIT and ABORT we delete the entry from FdwXactCtl->fdw_xacts.
+ *	  If fdw_xact->ondisk is true, we delete the corresponding entry from
+ *	  the disk as well.
+ *	* RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+ *	  have been modified to go through fdw_xact->inredo entries that have
+ *	  not made to disk yet.
+ *-------------------------------------------------------------------------
+ */
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "funcapi.h"
+
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/fdwxact_xlog.h"
+#include "access/htup_details.h"
+#include "access/twophase.h"
+#include "access/resolver_private.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "access/xlogutils.h"
+#include "catalog/pg_type.h"
+#include "foreign/foreign.h"
+#include "foreign/fdwapi.h"
+#include "libpq/pqsignal.h"
+#include "pg_trace.h"
+#include "pgstat.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/latch.h"
+#include "storage/lock.h"
+#include "storage/proc.h"
+#include "storage/procarray.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/memutils.h"
+#include "utils/guc.h"
+#include "utils/ps_status.h"
+#include "utils/snapmgr.h"
+
+/* Structure to bundle the foreign connection participating in transaction */
+typedef struct
+{
+	Oid			serverid;
+	Oid			userid;
+	Oid			umid;
+	char	   *servername;
+	FdwXact		fdw_xact;		/* foreign prepared transaction entry in case
+								 * prepared */
+	bool		two_phase_commit;		/* Should use two phase commit
+										 * protocol while committing
+										 * transaction on this server,
+										 * whenever necessary. */
+	bool		modified;		/* modified on foreign server in the transaction */
+	GetPrepareId_function get_prepare_id;
+	EndForeignTransaction_function end_foreign_xact;
+	PrepareForeignTransaction_function prepare_foreign_xact;
+	ResolvePreparedForeignTransaction_function resolve_prepared_foreign_xact;
+}	FdwConnection;
+
+/* List of foreign connections participating in the transaction */
+List	   *MyFdwConnections = NIL;
+
+/* Shmem hash entry */
+typedef struct
+{
+	/* tag */
+	TransactionId	xid;
+
+	/* List of FdwXacts associated with the xid */
+	FdwXact	first_fx;
+} FdwXactHashEntry;
+
+static HTAB	*FdwXactHash;
+
+/*
+ * By default we assume that all the foreign connections participating in this
+ * transaction can use two phase commit protocol.
+ */
+bool		TwoPhaseReady = true;
+
+/* Directory where the foreign prepared transaction files will reside */
+#define FDW_XACTS_DIR "pg_fdw_xact"
+
+/*
+ * Name of foreign prepared transaction file is 8 bytes xid, 8 bytes foreign
+ * server oid and 8 bytes user oid separated by '_'.
+ */
+#define FDW_XACT_FILE_NAME_LEN (8 + 1 + 8 + 1 + 8)
+#define FdwXactFilePath(path, xid, serverid, userid)	\
+	snprintf(path, MAXPGPATH, FDW_XACTS_DIR "/%08X_%08X_%08X", xid, \
+			 serverid, userid)
+
+static FdwXact FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+				  Oid umid, char *fdw_xact_info);
+static void FdwXactPrepareForeignTransactions(void);
+static void AtProcExit_FdwXact(int code, Datum arg);
+static bool FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+											 ResolvePreparedForeignTransaction_function prepared_foreign_xact_resolver);
+static void UnlockMyFdwXacts(void);
+static void remove_fdw_xact(FdwXact fdw_xact);
+static FdwXact insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid,
+							   Oid umid, char *fdw_xact_id);
+static int	GetFdwXactList(FdwXact * fdw_xacts);
+static ResolvePreparedForeignTransaction_function get_prepared_foreign_xact_resolver(FdwXact fdw_xact);
+static FdwXactOnDiskData *ReadFdwXactFile(TransactionId xid, Oid serverid,
+				Oid userid);
+static void RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+				  bool giveWarning);
+static void RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len);
+static void XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len);
+static FdwXact get_fdw_xact(TransactionId xid, Oid serverid, Oid userid);
+static bool search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+							List **qualifying_xacts);
+
+static void FdwXactQueueInsert(void);
+static void FdwXactCancelWait(void);
+
+/* Guc parameters */
+int			max_prepared_foreign_xacts = 0;
+int			max_foreign_xact_resolvers = 0;
+
+/* Keep track of registering process exit call back. */
+static bool fdwXactExitRegistered = false;
+
+/* Foreign transaction entries locked by this backend */
+List	   *MyLockedFdwXacts = NIL;
+FdwXactResolver *MyFdwXactResolver = NULL;
+
+/* Record the server, userid participating in the transaction. */
+void
+FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool two_phase_commit,
+							 bool modify)
+{
+	FdwConnection *fdw_conn;
+	ListCell   *lcell;
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	UserMapping *user_mapping;
+	FdwRoutine *fdw_routine;
+	MemoryContext old_context;
+
+	TwoPhaseReady = TwoPhaseReady && two_phase_commit;
+
+	/* Quick return if the entry already exists */
+	foreach(lcell, MyFdwConnections)
+	{
+		fdw_conn = lfirst(lcell);
+
+		/* Quick return if there is already registered connection */
+		if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+		{
+			fdw_conn->modified |= modify;
+			return;
+		}
+	}
+
+	/*
+	 * This list and its contents needs to be saved in the transaction context
+	 * memory
+	 */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	/* Add this foreign connection to the list for transaction management */
+	fdw_conn = (FdwConnection *) palloc(sizeof(FdwConnection));
+
+	/* Make sure that the FDW has at least a transaction handler */
+	foreign_server = GetForeignServer(serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	user_mapping = GetUserMapping(userid, serverid);
+
+	if (!fdw_routine->EndForeignTransaction)
+		ereport(ERROR,
+				(errmsg("no function to end a foreign transaction provided for FDW %s",
+						fdw->fdwname)));
+
+	if (two_phase_commit)
+	{
+		if (max_prepared_foreign_xacts == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_prepared_foreign_transactions to a nonzero value.")));
+
+		if (max_foreign_xact_resolvers == 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("prepread foreign transactions are disabled"),
+					 errhint("Set max_foreign_xact_resolvers to a nonzero value.")));
+
+		if (!fdw_routine->PrepareForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for preparing foreign transaction for FDW %s",
+							fdw->fdwname)));
+
+		if (!fdw_routine->ResolvePreparedForeignTransaction)
+			ereport(ERROR,
+					(errmsg("no function provided for resolving prepared foreign transaction for FDW %s",
+							fdw->fdwname)));
+	}
+
+	fdw_conn->serverid = serverid;
+	fdw_conn->userid = userid;
+	fdw_conn->umid = user_mapping->umid;
+
+	/*
+	 * We may need following information at the end of a transaction, when the
+	 * system caches are not available. So save it before hand.
+	 */
+	fdw_conn->servername = foreign_server->servername;
+	fdw_conn->get_prepare_id = fdw_routine->GetPrepareId;
+	fdw_conn->prepare_foreign_xact = fdw_routine->PrepareForeignTransaction;
+	fdw_conn->resolve_prepared_foreign_xact = fdw_routine->ResolvePreparedForeignTransaction;
+	fdw_conn->end_foreign_xact = fdw_routine->EndForeignTransaction;
+	fdw_conn->fdw_xact = NULL;
+	fdw_conn->modified = modify;
+	fdw_conn->two_phase_commit = two_phase_commit;
+	MyFdwConnections = lappend(MyFdwConnections, fdw_conn);
+	/* Revert back the context */
+	MemoryContextSwitchTo(old_context);
+
+	return;
+}
+
+/*
+ * FdwXactShmemSize
+ * Calculates the size of shared memory allocated for maintaining foreign
+ * prepared transaction entries.
+ */
+Size
+FdwXactShmemSize(void)
+{
+	Size		size;
+
+	/* Need the fixed struct, foreign transaction information array */
+	size = offsetof(FdwXactCtlData, fdw_xacts);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXact)));
+	size = MAXALIGN(size);
+	size = add_size(size, mul_size(max_prepared_foreign_xacts,
+								   sizeof(FdwXactData)));
+
+	size = MAXALIGN(size);
+	size = add_size(size, hash_estimate_size(max_prepared_foreign_xacts,
+											 sizeof(FdwXactHashEntry)));
+
+	return size;
+}
+
+/*
+ * FdwXactShmemInit
+ * Initialization of shared memory for maintaining foreign prepared transaction
+ * entries. The shared memory layout is defined in definition of
+ * FdwXactCtlData structure.
+ */
+void
+FdwXactShmemInit(void)
+{
+	bool		found;
+
+	FdwXactCtl = ShmemInitStruct("Foreign transactions table",
+								 FdwXactShmemSize(),
+								 &found);
+	if (!IsUnderPostmaster)
+	{
+		FdwXact		fdw_xacts;
+		HASHCTL		info;
+		long		init_hash_size;
+		long		max_hash_size;
+		int			cnt;
+
+		Assert(!found);
+		FdwXactCtl->freeFdwXacts = NULL;
+		FdwXactCtl->numFdwXacts = 0;
+
+		/* Initialise the linked list of free FDW transactions */
+		fdw_xacts = (FdwXact)
+			((char *) FdwXactCtl +
+			 MAXALIGN(offsetof(FdwXactCtlData, fdw_xacts) +
+					  sizeof(FdwXact) * max_prepared_foreign_xacts));
+		for (cnt = 0; cnt < max_prepared_foreign_xacts; cnt++)
+		{
+			fdw_xacts[cnt].fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = &fdw_xacts[cnt];
+		}
+
+		MemSet(&info, 0, sizeof(info));
+		info.keysize = sizeof(TransactionId);
+		info.entrysize = sizeof(FdwXactHashEntry);
+
+		max_hash_size = max_prepared_foreign_xacts;
+		init_hash_size = max_hash_size / 2;
+
+		FdwXactHash = ShmemInitHash("FdwXact hash",
+									init_hash_size,
+									max_hash_size,
+									&info,
+									HASH_ELEM | HASH_BLOBS);
+	}
+	else
+	{
+		Assert(FdwXactCtl);
+		Assert(found);
+	}
+}
+
+
+/*
+ * PreCommit_FdwXacts
+ *
+ * The function is responsible for pre-commit processing on foreign connections.
+ * Basically the foreign transactions are prepared on the foreign servers which
+ * can execute two-phase-commit protocol. For reset of foreign server, we commit
+ * transaction here.
+ *
+ * Either if the transaction modified only one foreign server of if the transaction
+ * modified a foreign server that can execute two-phase-commit protocol and modified
+ * local data, we don't need to two-phase-commit protocol.
+ */
+void
+PreCommit_FdwXacts(void)
+{
+	ListCell   *cur;
+	ListCell   *prev;
+	ListCell   *next;
+
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * Try committing transactions on the foreign servers, which can not
+	 * execute two-phase-commit protocol.
+	 */
+	for (cur = list_head(MyFdwConnections), prev = NULL; cur; cur = next)
+	{
+		FdwConnection *fdw_conn = lfirst(cur);
+
+		next = lnext(cur);
+
+		/*
+		 * We commit the foreign transactions on servers either that cannot
+		 * execute two-phase-commit protocol or that we didn't modified on
+		 */
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+		{
+			/*
+			 * The FDW has to make sure that the connection opened to the
+			 * foreign server is out of transaction. Even if the handler
+			 * function returns failure statue, there's hardly anything to do.
+			 */
+			if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, true))
+				elog(WARNING, "could not commit transaction on server %s",
+					 fdw_conn->servername);
+
+			/* The connection is no more part of this transaction, forget it */
+			MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+		}
+		else
+			prev = cur;
+	}
+
+	/* Return if we've committed all foreign servers */
+	if (list_length(MyFdwConnections) == 0)
+		return;
+
+	/*
+	 * Here, we have committed foreign transactions on foreign servers that can
+	 * not or don't need to execute two-phase-commit protocol and MyFdwConnections
+	 * has only foreign servers that need to execute two-phase-commit protocol.
+	 */
+	if (TwoPhaseCommitRequired())
+	{
+		/*
+		 * Prepare the transactions on the all foreign servers, which can
+		 * execute two-phase-commit protocol.
+		 */
+		FdwXactPrepareForeignTransactions();
+	}
+	else
+	{
+		FdwConnection *fdw_conn;
+
+		Assert(list_length(MyFdwConnections) == 1);
+		fdw_conn = lfirst(list_head(MyFdwConnections));
+
+		/*
+		 * We don't need to use two-phase commit protocol if the there is only
+		 * one server that can execute two-phase-commit protocol.
+		 */
+		if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+										fdw_conn->umid, true))
+			elog(WARNING, "could not commit transaction on server %s",
+				 fdw_conn->servername);
+
+		/* MyFdwConnections should be cleared here */
+		MyFdwConnections = list_delete_cell(MyFdwConnections, cur, prev);
+	}
+}
+
+/*
+ * prepare_foreign_transactions
+ *
+ * Prepare transactions on the foreign servers which can execute two phase
+ * commit protocol. Rest of the foreign servers are ignored.
+ */
+static void
+FdwXactPrepareForeignTransactions(void)
+{
+	ListCell   *lcell;
+	FdwXact		prev_fdwxact = NULL;
+
+	/*
+	 * Loop over the foreign connections
+	 */
+	foreach(lcell, MyFdwConnections)
+	{
+		FdwConnection *fdw_conn = (FdwConnection *) lfirst(lcell);
+		char	    *fdw_xact_id;
+		int			fdw_xact_id_len;
+		FdwXact		fdw_xact;
+
+		if (!fdw_conn->two_phase_commit || !fdw_conn->modified)
+			continue;
+
+
+		/* Generate prepare transaction id for foreign server */
+		Assert(fdw_conn->get_prepare_id);
+		fdw_xact_id = fdw_conn->get_prepare_id(fdw_conn->serverid,
+											   fdw_conn->userid,
+											   &fdw_xact_id_len);
+
+		/*
+		 * Register the foreign transaction with the identifier used to
+		 * prepare it on the foreign server. Registration persists this
+		 * information to the disk and logs (that way relaying it on standby).
+		 * Thus in case we loose connectivity to the foreign server or crash
+		 * ourselves, we will remember that we have prepared transaction on
+		 * the foreign server and try to resolve it when connectivity is
+		 * restored or after crash recovery.
+		 *
+		 * If we crash after persisting the information but before preparing
+		 * the transaction on the foreign server, we will try to resolve a
+		 * never-prepared transaction, and get an error. This is fine as long
+		 * as the FDW provides us unique prepared transaction identifiers.
+		 *
+		 * If we prepare the transaction on the foreign server before
+		 * persisting the information to the disk and crash in-between these
+		 * two steps, we will forget that we prepared the transaction on the
+		 * foreign server and will not be able to resolve it after the crash.
+		 * Hence persist first then prepare.
+		 */
+		fdw_xact = FdwXactRegisterFdwXact(MyDatabaseId, GetTopTransactionId(),
+									 fdw_conn->serverid, fdw_conn->userid,
+									 fdw_conn->umid, fdw_xact_id);
+
+		/*
+		 * Between FdwXactRegisterFdwXact call above till this backend hears back
+		 * from foreign server, the backend may abort the local transaction
+		 * (say, because of a signal). During abort processing, it will send
+		 * an ABORT message to the foreign server. If the foreign server has
+		 * not prepared the transaction, the message will succeed. If the
+		 * foreign server has prepared transaction, it will throw an error,
+		 * which we will ignore and the prepared foreign transaction will be
+		 * resolved by a foreign transaction resolver.
+		 */
+		if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+											fdw_conn->umid, fdw_xact_id))
+		{
+			StringInfo servername;
+			/*
+			 * An error occurred, and we didn't prepare the transaction.
+			 * Delete the entry from foreign transaction table. Raise an
+			 * error, so that the local server knows that one of the foreign
+			 * server has failed to prepare the transaction.
+			 *
+			 * XXX : FDW is expected to print the error as a warning and then
+			 * we raise actual error here. But instead, we should pull the
+			 * error text from FDW and add it here in the message or as a
+			 * context or a hint.
+			 */
+			remove_fdw_xact(fdw_xact);
+
+			/*
+			 * Delete the connection, since it doesn't require any further
+			 * processing. This deletion will invalidate current cell pointer,
+			 * but that is fine since we will not use that pointer because the
+			 * subsequent ereport will get us out of this loop.
+			 */
+			servername = makeStringInfo();
+			appendStringInfoString(servername, fdw_conn->servername);
+			MyFdwConnections = list_delete_ptr(MyFdwConnections, fdw_conn);
+			ereport(ERROR,
+					(errmsg("can not prepare transaction on foreign server %s",
+							servername->data)));
+		}
+
+		/* Prepare succeeded, remember it in the connection */
+		fdw_conn->fdw_xact = fdw_xact;
+
+		/*
+		 * If this is the first fdwxact entry we keep it in the hash table for
+		 * the later use.
+		 */
+		if (!prev_fdwxact)
+		{
+			FdwXactHashEntry	*fdwxact_entry;
+			bool				found;
+			TransactionId		key;
+
+			key = fdw_xact->local_xid;
+
+			LWLockAcquire(FdwXactLock,LW_EXCLUSIVE);
+			fdwxact_entry = (FdwXactHashEntry *) hash_search(FdwXactHash,
+															 &key,
+															 HASH_ENTER, &found);
+			LWLockRelease(FdwXactLock);
+
+			Assert(!found);
+			fdwxact_entry->first_fx = fdw_xact;
+		}
+		else
+		{
+			/*
+			 * Make a list of fdwxacts that are associated with the
+			 * same local transaction.
+			 */
+			Assert(fdw_xact->fx_next == NULL);
+			prev_fdwxact->fx_next = fdw_xact;
+		}
+
+		prev_fdwxact = fdw_xact;
+	}
+
+	return;
+}
+
+/*
+ * FdwXactRegisterFdwXact
+ *
+ * This function is used to create new foreign transaction entry before an FDW
+ * executes the first phase of two-phase commit. The function adds the entry to
+ * WAL and will be persisted to the disk under pg_fdw_xact directory when checkpoint.
+ */
+static FdwXact
+FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,
+					   Oid umid, char *fdw_xact_id)
+{
+	FdwXact		fdw_xact;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	MemoryContext	old_context;
+	int			data_len;
+
+	/* Enter the foreign transaction in the shared memory structure */
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(dbid, xid, serverid, userid, umid, fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->locking_backend = MyBackendId;
+
+	LWLockRelease(FdwXactLock);
+
+	/* Remember that we have locked this entry. */
+	old_context = MemoryContextSwitchTo(TopTransactionContext);
+	MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+	MemoryContextSwitchTo(old_context);
+
+	/*
+	 * Prepare to write the entry to a file. Also add xlog entry. The contents
+	 * of the xlog record are same as what is written to the file.
+	 */
+	data_len = offsetof(FdwXactOnDiskData, fdw_xact_id);
+	data_len = data_len + FDW_XACT_ID_LEN;
+	data_len = MAXALIGN(data_len);
+	fdw_xact_file_data = (FdwXactOnDiskData *) palloc0(data_len);
+	fdw_xact_file_data->dboid = fdw_xact->dboid;
+	fdw_xact_file_data->local_xid = fdw_xact->local_xid;
+	fdw_xact_file_data->serverid = fdw_xact->serverid;
+	fdw_xact_file_data->userid = fdw_xact->userid;
+	fdw_xact_file_data->umid = fdw_xact->umid;
+	memcpy(fdw_xact_file_data->fdw_xact_id, fdw_xact->fdw_xact_id,
+		   FDW_XACT_ID_LEN);
+
+	START_CRIT_SECTION();
+
+	/* Add the entry in the xlog and save LSN for checkpointer */
+	XLogBeginInsert();
+	XLogRegisterData((char *) fdw_xact_file_data, data_len);
+	fdw_xact->fdw_xact_end_lsn = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_INSERT);
+	XLogFlush(fdw_xact->fdw_xact_end_lsn);
+
+	/* Store record's start location to read that later on CheckPoint */
+	fdw_xact->fdw_xact_start_lsn = ProcLastRecPtr;
+
+	/* File is written completely, checkpoint can proceed with syncing */
+	fdw_xact->valid = true;
+
+	END_CRIT_SECTION();
+
+	pfree(fdw_xact_file_data);
+	return fdw_xact;
+}
+
+/*
+ * insert_fdw_xact
+ *
+ * Insert a new entry for a given foreign transaction identified by transaction
+ * id, foreign server and user mapping, in the shared memory. Caller must hold
+ * FdwXactLock in exclusive mode.
+ *
+ * If the entry already exists, the function raises an error.
+ */
+static FdwXact
+insert_fdw_xact(Oid dboid, TransactionId xid, Oid serverid, Oid userid, Oid umid,
+				char *fdw_xact_id)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	if (!fdwXactExitRegistered)
+	{
+		before_shmem_exit(AtProcExit_FdwXact, 0);
+		fdwXactExitRegistered = true;
+	}
+
+	/* Check for duplicating foreign transaction entry */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+			elog(ERROR, "duplicate entry for foreign transaction with transaction id %u, serverid %u, userid %u found",
+				 xid, serverid, userid);
+	}
+
+	/*
+	 * Get a next free foreign transaction entry. Raise error if there are
+	 * none left.
+	 */
+	if (!FdwXactCtl->freeFdwXacts)
+	{
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("maximum number of foreign transactions reached"),
+				 errhint("Increase max_prepared_foreign_transactions : \"%d\".",
+						 max_prepared_foreign_xacts)));
+	}
+
+	fdw_xact = FdwXactCtl->freeFdwXacts;
+	FdwXactCtl->freeFdwXacts = fdw_xact->fx_free_next;
+
+	/* Insert the entry to active array */
+	Assert(FdwXactCtl->numFdwXacts < max_prepared_foreign_xacts);
+	FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts++] = fdw_xact;
+
+	/* Stamp the entry with backend id before releasing the LWLock */
+	fdw_xact->locking_backend = InvalidBackendId;
+	fdw_xact->dboid = dboid;
+	fdw_xact->local_xid = xid;
+	fdw_xact->serverid = serverid;
+	fdw_xact->userid = userid;
+	fdw_xact->umid = umid;
+	fdw_xact->fdw_xact_start_lsn = InvalidXLogRecPtr;
+	fdw_xact->fdw_xact_end_lsn = InvalidXLogRecPtr;
+	fdw_xact->valid = false;
+	fdw_xact->ondisk = false;
+	fdw_xact->inredo = false;
+	memcpy(fdw_xact->fdw_xact_id, fdw_xact_id, FDW_XACT_ID_LEN);
+
+	return fdw_xact;
+}
+
+/*
+ * remove_fdw_xact
+ *
+ * Removes the foreign prepared transaction entry from shared memory, disk and
+ * logs about the removal in WAL.
+ */
+static void
+remove_fdw_xact(FdwXact fdw_xact)
+{
+	int			cnt;
+
+	Assert(fdw_xact != NULL);
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Search the slot where this entry resided */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		if (FdwXactCtl->fdw_xacts[cnt] == fdw_xact)
+		{
+			/* Remove the entry from active array */
+			FdwXactCtl->numFdwXacts--;
+			FdwXactCtl->fdw_xacts[cnt] = FdwXactCtl->fdw_xacts[FdwXactCtl->numFdwXacts];
+
+			/* Put it back into free list */
+			fdw_xact->fx_free_next = FdwXactCtl->freeFdwXacts;
+			FdwXactCtl->freeFdwXacts = fdw_xact;
+
+			/* Unlock the entry */
+			fdw_xact->locking_backend = InvalidBackendId;
+			fdw_xact->fx_next = NULL;
+			MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdw_xact);
+
+			LWLockRelease(FdwXactLock);
+
+			if (!RecoveryInProgress())
+			{
+				FdwRemoveXlogRec fdw_remove_xlog;
+				XLogRecPtr	recptr;
+
+				/* Fill up the log record before releasing the entry */
+				fdw_remove_xlog.serverid = fdw_xact->serverid;
+				fdw_remove_xlog.dbid = fdw_xact->dboid;
+				fdw_remove_xlog.xid = fdw_xact->local_xid;
+				fdw_remove_xlog.userid = fdw_xact->userid;
+
+				START_CRIT_SECTION();
+
+				/*
+				 * Log that we are removing the foreign transaction entry and
+				 * remove the file from the disk as well.
+				 */
+				XLogBeginInsert();
+				XLogRegisterData((char *) &fdw_remove_xlog, sizeof(fdw_remove_xlog));
+				recptr = XLogInsert(RM_FDW_XACT_ID, XLOG_FDW_XACT_REMOVE);
+				XLogFlush(recptr);
+
+				END_CRIT_SECTION();
+			}
+
+			/* Remove the file from the disk if exists. */
+			if (fdw_xact->ondisk)
+				RemoveFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								  fdw_xact->userid, true);
+			return;
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	/* We did not find the given entry in global array */
+	elog(ERROR, "failed to find foreign transaction entry for xid %u, foreign server %u, and user %u",
+		 fdw_xact->local_xid, fdw_xact->serverid, fdw_xact->userid);
+}
+
+/*
+ * We need to use two-phase commit protocol in two cases: we modified data on more
+ * than one foreign servers, or we modified data on both one foreign server and
+ * non temporary relation.
+ */
+bool
+TwoPhaseCommitRequired(void)
+{
+	if ((list_length(MyFdwConnections) > 1) ||
+		(list_length(MyFdwConnections) == 1 && (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL)))
+		return true;
+
+	return false;
+}
+
+/*
+ * UnlockMyFdwXacts
+ *
+ * Unlock the foreign transaction entries locked by this backend and removing
+ * it from the backend's list of foreign transactions.
+ */
+static void
+UnlockMyFdwXacts(void)
+{
+	ListCell *cell;
+	ListCell *next;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	for (cell = list_head(MyLockedFdwXacts); cell != NULL; cell = next)
+	{
+		FdwXact	fdwxact = (FdwXact) lfirst(cell);
+
+		next = lnext(cell);
+
+		/*
+		 * Because the resolver process can removed a fdwxact entry
+		 * after resolved and it will be free, it can happen that
+		 * a fdwxact entry in MyLockedFdwXacts is locked by an another
+		 * backend if it locked the entry before we unlock. So we need to
+		 * check if entries are begin locked by MyBackendId.
+		 */
+		if (fdwxact->locking_backend == MyBackendId)
+		{
+			/*
+			 * We set the locking backend as invalid, and then remove it from
+			 * the list of locked foreign transactions, under the LWLock. If we reverse
+			 * the order and process exits in-between those two, we will left an
+			 * entry locked by this backend, which gets unlocked only at the server
+			 * restart.
+			 */
+
+			fdwxact->locking_backend = InvalidBackendId;
+			MyLockedFdwXacts = list_delete_ptr(MyLockedFdwXacts, fdwxact);
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * AtProcExit_FdwXact
+ *
+ * When the process exits, unlock the entries it held.
+ */
+static void
+AtProcExit_FdwXact(int code, Datum arg)
+{
+	UnlockMyFdwXacts();
+}
+
+/*
+ * Wait for foreign transaction to be resolved.
+ *
+ * Initially backends start in state FDW_XACT_NOT_WAITING and then
+ * change that state to FDW_XACT_WAITING before adding ourselves
+ * to the wait queue. During FdwXactResolveForeignTransactions a fdwxact
+ * resolver changes the state to FDW_XACT_WAIT_COMPLETE once foreign
+ * transactions are resolved. This backend then resets its state
+ * to FDW_XACT_NOT_WAITING. If fdwxact_list is NULL, it means that
+ * we use the list of FdwXact just used, so set it to MyLockedFdwXacts.
+ *
+ * This function is inspired by SyncRepWaitForLSN.
+ */
+void
+FdwXactWaitToBeResolved(TransactionId wait_xid, bool is_commit)
+{
+	char		*new_status = NULL;
+	const char	*old_status;
+	ListCell	*cell;
+	List		*entries_to_resolve;
+
+	/*
+	 * Quick exit if user has not requested foreign transaction resolution
+	 * or there are no foreign servers that are modified in the current
+	 * transaction.
+	 */
+	if (!FdwXactEnabled())
+		return;
+
+	Assert(FdwXactCtl != NULL);
+	Assert(TransactionIdIsValid(wait_xid));
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	Assert(MyProc->fdwXactState == FDW_XACT_NOT_WAITING);
+
+	/*
+	 * Get the list of foreign transactions that are involved with the
+	 * given wait_xid.
+	 */
+	search_fdw_xact(wait_xid, MyDatabaseId, InvalidOid, InvalidOid,
+					&entries_to_resolve);
+
+	/* Quick exit if we found no foreign transaction to resolve */
+	if (list_length(entries_to_resolve) == 0)
+		return;
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+	/* Change status of fdw_xact entries according to is_commit */
+	foreach (cell, entries_to_resolve)
+	{
+		FdwXact fdw_xact = (FdwXact) lfirst(cell);
+
+		/* Don't overwrite status if fate is determined */
+		if (fdw_xact->status == FDW_XACT_PREPARING)
+			fdw_xact->status = (is_commit ?
+								FDW_XACT_COMMITTING_PREPARED :
+								FDW_XACT_ABORTING_PREPARED);
+	}
+
+	/* Set backend status and enqueue itself */
+	MyProc->fdwXactState = FDW_XACT_WAITING;
+	MyProc->fdwXactWaitXid = wait_xid;
+	FdwXactQueueInsert();
+	LWLockRelease(FdwXactLock);
+
+	/* Launch a resolver process if not yet and then wake up it */
+	fdwxact_maybe_launch_resolver();
+
+	/*
+	 * Alter ps display to show waiting for foreign transaction
+	 * resolution.
+	 */
+	if (update_process_title)
+	{
+		int len;
+
+		old_status = get_ps_display(&len);
+		new_status = (char *) palloc(len + 31 + 1);
+		memcpy(new_status, old_status, len);
+		sprintf(new_status + len, " waiting for resolution %d", wait_xid);
+		set_ps_display(new_status, false);
+		new_status[len] = '\0';	/* truncate off "waiting ..." */
+	}
+
+	/* Wait for all foreign transactions to be resolved */
+	for (;;)
+	{
+		/* Must reset the latch before testing state */
+		ResetLatch(MyLatch);
+
+		/*
+		 * Acquiring the lock is not needed, the latch ensures proper
+		 * barriers. If it looks like we're done, we must really be done,
+		 * because once walsender changes the state to FDW_XACT_WAIT_COMPLETE,
+		 * it will never update it again, so we can't be seeing a stale value
+		 * in that case.
+		 */
+		if (MyProc->fdwXactState == FDW_XACT_WAIT_COMPLETE)
+			break;
+
+		/*
+		 * If a wait for foreign transaction resolution is pending, we can
+		 * neither acknowledge the commit nor raise ERROR or FATAL.  The latter
+		 * would lead the client to believe that the distributed transaction
+		 * aborted, which is not true: it's already committed locally. The
+		 * former is no good either: the client has requested committing a
+		 * distributed transaction, and is entitled to assume that a acknowledged
+		 * commit is also commit on all foreign servers, which might not be
+		 * true. So in this case we issue a WARNING (which some clients may
+		 * be able to interpret) and shut off further output. We do NOT reset
+		 * PorcDiePending, so that the process will die after the commit is
+		 * cleaned up.
+		 */
+		if (ProcDiePending)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("canceling the wait for resolving foreign transaction and terminating connection due to administrator command"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If a query cancel interrupt arrives we just terminate the wait with
+		 * a suitable warning. The foreign transactions can be orphaned but
+		 * the foreign xact resolver can pick up them and tries to resolve them
+		 * later.
+		 */
+		if (QueryCancelPending)
+		{
+			QueryCancelPending = false;
+			ereport(WARNING,
+					(errmsg("canceling wait for resolving foreign transaction due to user request"),
+					 errdetail("The transaction has already committed locally, but might not have been committed on the foreign server.")));
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * If the postmaster dies, we'll probably never get an
+		 * acknowledgement, because all the wal sender processes will exit. So
+		 * just bail out.
+		 */
+		if (!PostmasterIsAlive())
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			FdwXactCancelWait();
+			break;
+		}
+
+		/*
+		 * Wait on latch.  Any condition that should wake us up will set the
+		 * latch, so no need for timeout.
+		 */
+		WaitLatch(MyLatch, WL_LATCH_SET | WL_POSTMASTER_DEATH, -1,
+				  WAIT_EVENT_FDW_XACT_RESOLUTION);
+	}
+
+	pg_read_barrier();
+
+	Assert(SHMQueueIsDetached(&(MyProc->fdwXactLinks)));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+
+	/*
+	 * Unlock the list of locked entries, also means that the entries
+	 * that could not resolved are remained as dangling transactions.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	if (new_status)
+	{
+		set_ps_display(new_status, false);
+		pfree(new_status);
+	}
+}
+
+/*
+ * Acquire FdwXactLock and cancel any wait currently in progress.
+ */
+static void
+FdwXactCancelWait(void)
+{
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+		SHMQueueDelete(&(MyProc->fdwXactLinks));
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	LWLockRelease(FdwXactLock);
+}
+
+/*
+ * Insert MyProc into the tail of FdwXactQueue.
+ */
+static void
+FdwXactQueueInsert(void)
+{
+	SHMQueueInsertBefore(&(FdwXactRslvCtl->FdwXactQueue),
+						 &(MyProc->fdwXactLinks));
+}
+
+void
+FdwXactCleanupAtProcExit(void)
+{
+	if (!SHMQueueIsDetached(&(MyProc->fdwXactLinks)))
+	{
+		LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+		SHMQueueDelete(&(MyProc->fdwXactLinks));
+		LWLockRelease(FdwXactLock);
+	}
+}
+
+/*
+ * Create and initialize an FdwXactResolveState which is used
+ * for resolution of foreign transactions.
+ */
+FdwXactResolveState *
+CreateFdwXactResolveState(void)
+{
+	FdwXactResolveState *fstate = palloc0(sizeof(FdwXactResolveState));
+
+	fstate->dbid = MyDatabaseId;
+	fstate->fdwxact = NULL;
+	fstate->waiter = NULL;
+
+	return fstate;
+}
+
+/*
+ * Resolve a distributed transaction on the given database and release a
+ * waiter. Return true if resolved a distirubted transaction. Return false
+ * if we didn't resolved any transaction.
+ *
+ * We can release the waiter iff all of the foreign transactions are resolved.
+ * To keep consistency between concurrent distributed transaction, we must
+ * process them one by one in order of registered. We must not switch to
+ * next distributed transaction even if we failed to the one of the foreign
+ * transaction. In this case we remember the failed foreign transactions
+ * and retry them at the next time.
+ */
+bool
+FdwXactResolveDistributedTransaction(FdwXactResolveState *fstate)
+{
+	volatile FdwXact	failed_to_resolve = NULL;
+	bool				resolved;
+
+	Assert(fstate->dbid == MyDatabaseId);
+
+	if (!fstate->fdwxact)
+	{
+		FdwXactHashEntry	*fdwxact_entry;
+		bool				found;
+
+		LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+		if (!fstate->waiter)
+		{
+			PGPROC	*proc;
+
+			/* Fetch a waiter from beginning of the queue */
+			for (;;)
+			{
+				proc = (PGPROC *) SHMQueueNext(&(FdwXactRslvCtl->FdwXactQueue),
+											   &(FdwXactRslvCtl->FdwXactQueue),
+											   offsetof(PGPROC, fdwXactLinks));
+
+				/* Return if there is not any entry in the queue */
+				if (!proc)
+				{
+					LWLockRelease(FdwXactLock);
+					return false;
+				}
+
+				/* Found a target proc */
+				if (proc->databaseId == fstate->dbid)
+					break;
+			}
+
+			Assert(TransactionIdIsValid(proc->fdwXactWaitXid));
+			fstate->waiter = proc;
+		}
+
+		/* Search fdwxact entry from the hash table by the local transaction id */
+		fdwxact_entry = (FdwXactHashEntry *)
+			hash_search(FdwXactHash, (void *) &(fstate->waiter->fdwXactWaitXid),
+						HASH_ENTER, &found);
+
+		if (found)
+			fstate->fdwxact = fdwxact_entry->first_fx;
+		else
+		{
+			int i;
+			FdwXact entries_to_resolve = NULL;
+			FdwXact prev_fx = NULL;
+
+			/*
+			 * The fdwxact entry doesn't exist in the hash table in case where
+			 * a prepared transaction is resolved after recovery. In this case,
+			 * we construct a list of fdw xact entries by scanning over the
+			 * FdwXactCtl->fdw_xacts list.
+			 */
+			fdwxact_entry->first_fx = NULL;
+
+			for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+			{
+				FdwXact fx = FdwXactCtl->fdw_xacts[i];
+
+				if (fx->dboid == fstate->dbid && fx->local_xid == fstate->waiter->fdwXactWaitXid)
+				{
+					if (!entries_to_resolve)
+						entries_to_resolve = fx;
+
+					/* Link from previous entry to this entry */
+					if (prev_fx)
+						prev_fx->fx_next = fx;
+
+					prev_fx = fx;
+				}
+			}
+
+			fstate->fdwxact = entries_to_resolve;
+		}
+
+		LWLockRelease(FdwXactLock);
+	}
+
+	/* Resolve all foreign transactions associated with the transactionid */
+	while (fstate->fdwxact)
+	{
+		volatile FdwXact fx_next = NULL;
+		volatile FdwXact fdwxact = fstate->fdwxact;
+
+		/*
+		 * Remember the next entry to resolve as the current entry will be
+		 * removed after resolved.
+		 */
+		fx_next = fdwxact->fx_next;
+
+		/* Resolve a foreign transaction */
+		if (!FdwXactResolveForeignTransaction(fdwxact,
+											  get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			CHECK_FOR_INTERRUPTS();
+
+			/*
+			 * If failed to resolve a foreign transaction then we append it
+			 * to the list to remember them.
+			 */
+			LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+			if (failed_to_resolve == NULL)
+			{
+				fdwxact->fx_next = NULL;
+				failed_to_resolve = fdwxact;
+			}
+			else
+			{
+				FdwXact fx = failed_to_resolve;
+
+				/* Append the entry to the list */
+				while (fx != NULL) fx = fx->fx_next;
+				fx->fx_next = fdwxact;
+
+			}
+			LWLockRelease(FdwXactLock);
+
+			elog(DEBUG2, "failed to resolve foreign transaction xid %u, umid %u",
+				 fdwxact->local_xid, fdwxact->umid);
+		}
+		else
+			elog(DEBUG2, "resolved foreign transaction xid %u, umid %u",
+				 fdwxact->local_xid, fdwxact->umid);
+
+		fstate->fdwxact = fx_next;
+	}
+
+	resolved = (failed_to_resolve == NULL);
+
+	if (resolved)
+	{
+		LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+
+		/* Remove fdwxact entry from shmem hash table */
+		hash_search(FdwXactHash, (void *) &(fstate->waiter->fdwXactWaitXid),
+					HASH_REMOVE, NULL);
+
+		/*
+		 * Remove waiter from shmem queue, if not detached yet. If the waiter has
+		 * canceled to wait before resolution, waiter is already detached.
+		 */
+		if (!SHMQueueIsDetached(&(fstate->waiter->fdwXactLinks)))
+		{
+			TransactionId	wait_xid = fstate->waiter->fdwXactWaitXid;
+
+			SHMQueueDelete(&(fstate->waiter->fdwXactLinks));
+
+			pg_write_barrier();
+
+			/* Set state to complete */
+			fstate->waiter->fdwXactState = FDW_XACT_WAIT_COMPLETE;
+
+			/* Wake up the waiter only when we have set state and removed from queue */
+			SetLatch(&(fstate->waiter->procLatch));
+
+			elog(DEBUG2, "released a proc xid %u", wait_xid);
+		}
+
+		LWLockRelease(FdwXactLock);
+
+		/* Reset resolution state */
+		fstate->waiter = NULL;
+		Assert(fstate->fdwxact == NULL);
+	}
+	else
+		/* Update the foreign transaction to resolve for next resolution cycle */
+		fstate->fdwxact = failed_to_resolve;
+
+	return resolved;
+}
+
+/*
+ * Resolve all dangling foreign transactions on the given database. For dangling
+ * transaction resolution, we don't bother the order of resolution because these
+ * entries already got out of order.
+ */
+bool
+FdwXactResolveDanglingTransactions(Oid dbid)
+{
+	List		*fxact_list = NIL;
+	ListCell	*cell;
+	bool		n_resolved = 0;
+	int			i;
+
+	Assert(OidIsValid(dbid));
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	/*
+	 * Create a list of dangling transactions of which corresponding local
+	 * transaction is on the given database.
+	 */
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		FdwXact fxact = FdwXactCtl->fdw_xacts[i];
+
+		/*
+		 * Append the fdwxact entry on the given database to the list if
+		 * it's not locked by anyone and is not part of the prepared transaction.
+		 */
+		if (fxact->dboid == dbid &&
+			fxact->locking_backend == InvalidBackendId &&
+			!TwoPhaseExists(fxact->local_xid))
+			fxact_list = lappend(fxact_list, fxact);
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	/* There is no foreign transaction we need to resolve */
+	if (list_length(fxact_list) == 0)
+		return 0;
+
+	foreach(cell, fxact_list)
+	{
+		FdwXact fdwxact = (FdwXact) lfirst(cell);
+
+		if (!FdwXactResolveForeignTransaction(fdwxact,
+											  get_prepared_foreign_xact_resolver(fdwxact)))
+		{
+			/*
+			 * If failed to resolve this foreign transaction we skip it in
+			 * this resolution cycle. Try to resolve again in next cycle.
+			 */
+			ereport(WARNING, (errmsg("could not resolve dangling foreign transaction for xid %u, foreign server %u and user %d",
+									 fdwxact->local_xid, fdwxact->serverid, fdwxact->userid)));
+			continue;
+		}
+
+		n_resolved++;
+	}
+
+	list_free(fxact_list);
+
+	elog(DEBUG2, "resolved %d dangling foreign xacts", n_resolved);
+
+	return n_resolved;
+}
+
+/*
+ * AtEOXact_FdwXacts
+ *
+ */
+extern void
+AtEOXact_FdwXacts(bool is_commit)
+{
+	ListCell   *lcell;
+
+	/*
+	 * In commit case, we have already committed the foreign transactions on
+	 * the servers that cannot execute two-phase commit protocol, and prepared
+	 * transaction on the server that can use two-phase commit protocol
+	 * in-precommit phase. And the prepared transactions should be resolved
+	 * by the resolver process. So we don't do anything about the foreign
+	 * transaction. On the other hand in abort case, since we might either
+	 * prepare or be preparing some transactions on foreign servers we need
+	 * to abort the foreign transaction that are not prepared yet.
+	 */
+	if (!is_commit)
+	{
+		foreach (lcell, MyFdwConnections)
+		{
+			FdwConnection	*fdw_conn = lfirst(lcell);
+
+			/*
+			 * Since the prepared foreign transaction should have been
+			 * resolved we abort the remaining not-prepared foreign
+			 * transactions.
+			 */
+			if (!fdw_conn->fdw_xact)
+			{
+				bool ret;
+
+				ret = fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+												 fdw_conn->umid, is_commit);
+				if (!ret)
+					ereport(WARNING, (errmsg("could not abort transaction on server \"%s\"",
+											 fdw_conn->servername)));
+			}
+		}
+	}
+
+	/*
+	 * Unlock any locked foreign transactions. Other backend might lock the
+	 * entry we used to lock, but there is no reason for a foreign transaction
+	 * entry to be locked after the transaction which locked it has ended.
+	 */
+	UnlockMyFdwXacts();
+	MyLockedFdwXacts = NIL;
+
+	/*
+	 * Reset the list of registered connections. Since the memory for the list
+	 * and its nodes comes from transaction memory context, it will be freed
+	 * after this call.
+	 */
+	MyFdwConnections = NIL;
+
+	/* Set TwoPhaseReady to its default value */
+	TwoPhaseReady = true;
+}
+
+/*
+ * AtPrepare_FdwXacts
+ *
+ * The function is called while preparing a transaction. If there are foreign
+ * servers involved in the transaction, this function prepares transactions
+ * on those servers.
+ *
+ * Note that it can happen that the transaction abort after we prepared foreign
+ * transactions. So we cannot unlock both MyLockedFdwXacts and MyFdwConnections
+ * here. These are unlocked after rollbacked by resolver process during
+ * aborting, or at EOXact_FdwXacts().
+ */
+void
+AtPrepare_FdwXacts(void)
+{
+	/* If there are no foreign servers involved, we have no business here */
+	if (list_length(MyFdwConnections) < 1)
+		return;
+
+	/*
+	 * All foreign servers participating in a transaction to be prepared
+	 * should be two phase compliant.
+	 */
+	if (!TwoPhaseReady)
+		ereport(ERROR,
+				(errcode(ERRCODE_T_R_INTEGRITY_CONSTRAINT_VIOLATION),
+				 errmsg("can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction")));
+
+	/* Prepare transactions on participating foreign servers. */
+	FdwXactPrepareForeignTransactions();
+}
+
+/*
+ * get_prepared_foreign_xact_resolver
+ */
+static ResolvePreparedForeignTransaction_function
+get_prepared_foreign_xact_resolver(FdwXact fdw_xact)
+{
+	ForeignServer *foreign_server;
+	ForeignDataWrapper *fdw;
+	FdwRoutine *fdw_routine;
+
+	foreign_server = GetForeignServer(fdw_xact->serverid);
+	fdw = GetForeignDataWrapper(foreign_server->fdwid);
+	fdw_routine = GetFdwRoutine(fdw->fdwhandler);
+	if (!fdw_routine->ResolvePreparedForeignTransaction)
+		elog(ERROR, "no foreign transaction resolver routine provided for FDW %s",
+			 fdw->fdwname);
+
+	return fdw_routine->ResolvePreparedForeignTransaction;
+}
+
+/*
+ * FdwXactResolveForeignTransaction
+ *
+ * Resolve the foreign transaction using the foreign data wrapper's transaction
+ * handler routine. The foreign transaction can be a dangling transaction
+ * that is not locked by any backend.
+ *
+ * If the resolution is successful, remove the foreign transaction entry from
+ * the shared memory and also remove the corresponding on-disk file.
+ */
+static bool
+FdwXactResolveForeignTransaction(FdwXact fdw_xact,
+			   ResolvePreparedForeignTransaction_function fdw_xact_handler)
+{
+	bool		resolved;
+	bool		is_commit;
+
+	if(!(fdw_xact->status == FDW_XACT_COMMITTING_PREPARED ||
+		 fdw_xact->status == FDW_XACT_ABORTING_PREPARED))
+		elog(DEBUG1, "fdwxact status : %d", fdw_xact->status);
+
+	/*
+	 * Determine whether we commit or abort this foreign transaction.
+	 */
+	if (fdw_xact->status == FDW_XACT_COMMITTING_PREPARED)
+		is_commit = true;
+	else if (fdw_xact->status == FDW_XACT_ABORTING_PREPARED)
+		is_commit = false;
+
+	/*
+	 * If the local transaction is already committed, commit prepared
+	 * foreign transactions as well.
+	 */
+	else if (TransactionIdDidCommit(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_COMMITTING_PREPARED;
+		is_commit = true;
+	}
+
+	/*
+	 * If the local transaction is already aborted, abort prepared
+	 * foreign transactions as well.
+	 */
+	else if (TransactionIdDidAbort(fdw_xact->local_xid))
+	{
+		fdw_xact->status = FDW_XACT_ABORTING_PREPARED;
+		is_commit = false;
+	}
+
+	/*
+	 * The local transaction is not in progress but the foreign
+	 * transaction is not prepared on the foreign server. This
+	 * can happen when we crashed after registered this entry but
+	 * before actual preparing on the foreign server. So we assume
+	 * it to be aborted.
+	 */
+	else if (!TransactionIdIsInProgress(fdw_xact->local_xid))
+		is_commit = false;
+
+	/*
+	 * The Local transaction is in progress and foreign transaction
+	 * state is neither committing or aborting. This should not
+	 * happen because we cannot determine to do commit or abort for
+	 * foreign transaction associated with the in-progress local
+	 * transaction.
+	 */
+	else
+		ereport(ERROR,
+				(errmsg("cannot resolve foreign transaction associated with in-progress transaction %u on server %u",
+						fdw_xact->local_xid, fdw_xact->serverid)));
+
+	resolved = fdw_xact_handler(fdw_xact->serverid, fdw_xact->userid,
+								fdw_xact->umid, is_commit,
+								fdw_xact->fdw_xact_id);
+
+	/* If we succeeded in resolving the transaction, remove the entry */
+	if (resolved)
+		remove_fdw_xact(fdw_xact);
+
+	return resolved;
+}
+
+/*
+ * Get foreign transaction entry from FdwXactCtl->fdw_xacts. Return NULL
+ * if foreign transaction does not exist.
+ */
+static FdwXact
+get_fdw_xact(TransactionId xid, Oid serverid, Oid userid)
+{
+	int i;
+	FdwXact fdw_xact;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	for (i = 0; i < FdwXactCtl->numFdwXacts; i++)
+	{
+		fdw_xact = FdwXactCtl->fdw_xacts[i];
+
+		if (fdw_xact->local_xid == xid &&
+			fdw_xact->serverid == serverid &&
+			fdw_xact->userid == userid)
+		{
+			LWLockRelease(FdwXactLock);
+			return fdw_xact;
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+	return NULL;
+}
+
+/*
+ * fdw_xact_exists
+ * Returns true if there exists at least one prepared foreign transaction which
+ * matches criteria. This function is wrapper around search_fdw_xact. Check that
+ * function's prologue for details.
+ */
+bool
+fdw_xact_exists(TransactionId xid, Oid dbid, Oid serverid, Oid userid)
+{
+	return search_fdw_xact(xid, dbid, serverid, userid, NULL);
+}
+
+/*
+ * search_fdw_xact
+ * Return true if there exists at least one prepared foreign transaction
+ * entry with given criteria. The criteria is defined by arguments with
+ * valid values for respective datatypes.
+ *
+ * The table below explains the same
+ * xid	   | dbid	 | serverid | userid  | search for entry with
+ * invalid | invalid | invalid	| invalid | nothing
+ * invalid | invalid | invalid	| valid   | given userid
+ * invalid | invalid | valid	| invalid | given serverid
+ * invalid | invalid | valid	| valid   | given serverid and userid
+ * invalid | valid	 | invalid	| invalid | given dbid
+ * invalid | valid	 | invalid	| valid   | given dbid and userid
+ * invalid | valid	 | valid	| invalid | given dbid and serverid
+ * invalid | valid	 | valid	| valid   | given dbid, serveroid and userid
+ * valid   | invalid | invalid	| invalid | given xid
+ * valid   | invalid | invalid	| valid   | given xid and userid
+ * valid   | invalid | valid	| invalid | given xid, serverid
+ * valid   | invalid | valid	| valid   | given xid, serverid, userid
+ * valid   | valid	 | invalid	| invalid | given xid and dbid
+ * valid   | valid	 | invalid	| valid   | given xid, dbid and userid
+ * valid   | valid	 | valid	| invalid | given xid, dbid, serverid
+ * valid   | valid	 | valid	| valid   | given xid, dbid, serverid, userid
+ *
+ * When the criteria is void (all arguments invalid) the
+ * function returns true, since any entry would match the criteria.
+ *
+ * If qualifying_fdw_xacts is not NULL, the qualifying entries are locked and
+ * returned in a linked list. Any entry which is already locked is ignored. If
+ * all the qualifying entries are locked, nothing will be returned in the list
+ * but returned value will be true.
+ */
+static bool
+search_fdw_xact(TransactionId xid, Oid dbid, Oid serverid, Oid userid,
+				List **qualifying_xacts)
+{
+	int			cnt;
+	LWLockMode	lock_mode;
+
+	/* Return value if a qualifying entry exists */
+	bool		entry_exists = false;
+
+	if (qualifying_xacts)
+	{
+		*qualifying_xacts = NIL;
+		/* The caller expects us to lock entries */
+		lock_mode = LW_EXCLUSIVE;
+	}
+	else
+		lock_mode = LW_SHARED;
+
+	LWLockAcquire(FdwXactLock, lock_mode);
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+		bool		entry_matches = true;
+
+		/* xid */
+		if (xid != InvalidTransactionId && xid != fdw_xact->local_xid)
+			entry_matches = false;
+
+		/* dbid */
+		if (OidIsValid(dbid) && fdw_xact->dboid != dbid)
+			entry_matches = false;
+
+		/* serverid */
+		if (OidIsValid(serverid) && serverid != fdw_xact->serverid)
+			entry_matches = false;
+
+		/* userid */
+		if (OidIsValid(userid) && fdw_xact->userid != userid)
+			entry_matches = false;
+
+		if (entry_matches)
+		{
+			entry_exists = true;
+			if (qualifying_xacts)
+			{
+				/*
+				 * User has requested list of qualifying entries. If the
+				 * matching entry is not locked, lock it and add to the list.
+				 * If the entry is locked by some other backend, ignore it.
+				 */
+				if (fdw_xact->locking_backend == InvalidBackendId)
+				{
+					MemoryContext oldcontext;
+
+					fdw_xact->locking_backend = MyBackendId;
+
+					/*
+					 * The list and its members may be required at the end of
+					 * the transaction
+					 */
+					oldcontext = MemoryContextSwitchTo(TopTransactionContext);
+					MyLockedFdwXacts = lappend(MyLockedFdwXacts, fdw_xact);
+					MemoryContextSwitchTo(oldcontext);
+				}
+				else if (fdw_xact->locking_backend != MyBackendId)
+					continue;
+
+				*qualifying_xacts = lappend(*qualifying_xacts, fdw_xact);
+			}
+			else
+			{
+				/*
+				 * User wants to check the existence, and we have found one
+				 * matching entry. No need to check other entries.
+				 */
+				break;
+			}
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	return entry_exists;
+}
+
+/*
+ * fdw_xact_redo
+ * Apply the redo log for a foreign transaction.
+ */
+void
+fdw_xact_redo(XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_FDW_XACT_INSERT)
+		FdwXactRedoAdd(record);
+	else if (info == XLOG_FDW_XACT_REMOVE)
+	{
+		FdwRemoveXlogRec *fdw_remove_xlog = (FdwRemoveXlogRec *) rec;
+
+		/* Delete FdwXact entry and file if exists */
+		FdwXactRedoRemove(fdw_remove_xlog->xid, fdw_remove_xlog->serverid,
+						  fdw_remove_xlog->userid);
+	}
+	else
+		elog(ERROR, "invalid log type %d in foreign transction log record", info);
+
+	return;
+}
+
+/*
+ * CheckPointFdwXact
+ *
+ * Function syncs the foreign transaction files created between the two
+ * checkpoints. The foreign transaction entries and hence the corresponding
+ * files are expected to be very short-lived. By executing this function at the
+ * end, we might have lesser files to fsync, thus reducing some I/O. This is
+ * similar to CheckPointTwoPhase().
+ *
+ * In order to avoid disk I/O while holding a light weight lock, the function
+ * first collects the files which need to be synced under FdwXactLock and then
+ * syncs them after releasing the lock. This approach creates a race condition:
+ * after releasing the lock, and before syncing a file, the corresponding
+ * foreign transaction entry and hence the file might get removed. The function
+ * checks whether that's true and ignores the error if so.
+ */
+void
+CheckPointFdwXacts(XLogRecPtr redo_horizon)
+{
+	int			cnt;
+	int			serialized_fdw_xacts = 0;
+
+	/* Quick get-away, before taking lock */
+	if (max_prepared_foreign_xacts <= 0)
+		return;
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_START();
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	/* Another quick, before we allocate memory */
+	if (FdwXactCtl->numFdwXacts <= 0)
+	{
+		LWLockRelease(FdwXactLock);
+		return;
+	}
+
+	/*
+	 * We are expecting there to be zero FdwXact that need to be copied to
+	 * disk, so we perform all I/O while holding FdwXactLock for simplicity.
+	 * This presents any new foreign xacts from preparing while this occurs,
+	 * which shouldn't be a problem since the presence fo long-lived prepared
+	 * foreign xacts indicated the transaction manager isn't active.
+	 *
+	 * it's also possible to move I/O out of the lock, but on every error we
+	 * should check whether somebody committed our transaction in different
+	 * backend. Let's leave this optimisation for future, if somebody will
+	 * spot that this place cause bottleneck.
+	 *
+	 * Note that it isn't possible for there to be a FdwXact with a
+	 * fdw_xact_end_lsn set prior to the last checkpoint yet is marked
+	 * invalid, because of the efforts with delayChkpt.
+	 */
+	for (cnt = 0; cnt < FdwXactCtl->numFdwXacts; cnt++)
+	{
+		FdwXact		fdw_xact = FdwXactCtl->fdw_xacts[cnt];
+
+		if ((fdw_xact->valid || fdw_xact->inredo) &&
+			!fdw_xact->ondisk &&
+			fdw_xact->fdw_xact_end_lsn <= redo_horizon)
+		{
+			char	   *buf;
+			int			len;
+
+			XlogReadFdwXactData(fdw_xact->fdw_xact_start_lsn, &buf, &len);
+			RecreateFdwXactFile(fdw_xact->local_xid, fdw_xact->serverid,
+								fdw_xact->userid, buf, len);
+			fdw_xact->ondisk = true;
+			serialized_fdw_xacts++;
+			pfree(buf);
+		}
+	}
+
+	LWLockRelease(FdwXactLock);
+
+	TRACE_POSTGRESQL_FDWXACT_CHECKPOINT_DONE();
+
+	if (log_checkpoints && serialized_fdw_xacts > 0)
+		ereport(LOG,
+			  (errmsg_plural("%u foreign transaction state file was written "
+							 "for long-running prepared transactions",
+							 "%u foreign transaction state files were written "
+							 "for long-running prepared transactions",
+							 serialized_fdw_xacts,
+							 serialized_fdw_xacts)));
+}
+
+/*
+ * Reads foreign trasasction data from xlog. During checkpoint this data will
+ * be moved to fdwxact files and ReadFdwXactFile should be used instead.
+ *
+ * Note clearly that this function accesses WAL during normal operation, similarly
+ * to the way WALSender or Logical Decoding would do. It does not run during
+ * crash recovery or standby processing.
+ */
+static void
+XlogReadFdwXactData(XLogRecPtr lsn, char **buf, int *len)
+{
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, &read_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+		   errdetail("Failed while allocating an XLog reading processor.")));
+
+	record = XLogReadRecord(xlogreader, lsn, &errormsg);
+
+	if (record == NULL)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not read foreign transaction state from xlog at %X/%X",
+			   (uint32) (lsn >> 32),
+			   (uint32) lsn)));
+
+	if (XLogRecGetRmid(xlogreader) != RM_FDW_XACT_ID ||
+		(XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK) != XLOG_FDW_XACT_INSERT)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("expected foreign transaction state data is not present in xlog at %X/%X",
+						(uint32) (lsn >> 32),
+						(uint32) lsn)));
+
+	if (len != NULL)
+		*len = XLogRecGetDataLen(xlogreader);
+
+	*buf = palloc(sizeof(char) * XLogRecGetDataLen(xlogreader));
+	memcpy(*buf, XLogRecGetData(xlogreader), sizeof(char) * XLogRecGetDataLen(xlogreader));
+
+	XLogReaderFree(xlogreader);
+}
+
+/*
+ * Recreates a foreign transaction state file. This is used in WAL replay and
+ * during checkpoint creation.
+ *
+ * Note: content and len don't include CRC.
+ */
+void
+RecreateFdwXactFile(TransactionId xid, Oid serverid, Oid userid,
+					void *content, int len)
+{
+	char		path[MAXPGPATH];
+	pg_crc32c	fdw_xact_crc;
+	pg_crc32c	bogus_crc;
+	int			fd;
+
+	/* Recompute CRC */
+	INIT_CRC32C(fdw_xact_crc);
+	COMP_CRC32C(fdw_xact_crc, content, len);
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_CREAT | O_TRUNC | O_WRONLY | PG_BINARY);
+
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+		errmsg("could not recreate foreign transaction state file \"%s\": %m",
+			   path)));
+
+	if (write(fd, content, len) != len)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transcation state file: %m")));
+	}
+	FIN_CRC32C(fdw_xact_crc);
+
+	/*
+	 * Write a deliberately bogus CRC to the state file; this is just paranoia
+	 * to catch the case where four more bytes will run us out of disk space.
+	 */
+	bogus_crc = ~fdw_xact_crc;
+	if ((write(fd, &bogus_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreing transaction state file: %m")));
+	}
+	/* Back up to prepare for rewriting the CRC */
+	if (lseek(fd, -((off_t) sizeof(pg_crc32c)), SEEK_CUR) < 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			errmsg("could not seek in foreign transaction state file: %m")));
+	}
+
+	/* write correct CRC and close file */
+	if ((write(fd, &fdw_xact_crc, sizeof(pg_crc32c))) != sizeof(pg_crc32c))
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not write foreign transaction state file: %m")));
+	}
+
+	/*
+	 * We must fsync the file because the end-of-replay checkpoint will not do
+	 * so, there being no GXACT in shared memory yet to tell it to.
+	 */
+	if (pg_fsync(fd) != 0)
+	{
+		CloseTransientFile(fd);
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			  errmsg("could not fsync foreign transaction state file: %m")));
+	}
+
+	if (CloseTransientFile(fd) != 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not close foreign transaction file: %m")));
+}
+
+/* Built in functions */
+/*
+ * Structure to hold and iterate over the foreign transactions to be displayed
+ * by the built-in functions.
+ */
+typedef struct
+{
+	FdwXact		fdw_xacts;
+	int			num_xacts;
+	int			cur_xact;
+}	WorkingStatus;
+
+Datum
+pg_prepared_fdw_xacts(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	WorkingStatus *status;
+	char	   *xact_status;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext oldcontext;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * Switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* build tupdesc for result tuples */
+		/* this had better match pg_fdw_xacts view in system_views.sql */
+		tupdesc = CreateTemplateTupleDesc(6, false);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 1, "dbid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 2, "transaction",
+						   XIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 3, "serverid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 4, "userid",
+						   OIDOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 5, "status",
+						   TEXTOID, -1, 0);
+		TupleDescInitEntry(tupdesc, (AttrNumber) 6, "identifier",
+						   TEXTOID, -1, 0);
+
+		funcctx->tuple_desc = BlessTupleDesc(tupdesc);
+
+		/*
+		 * Collect status information that we will format and send out as a
+		 * result set.
+		 */
+		status = (WorkingStatus *) palloc(sizeof(WorkingStatus));
+		funcctx->user_fctx = (void *) status;
+
+		status->num_xacts = GetFdwXactList(&status->fdw_xacts);
+		status->cur_xact = 0;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+	status = funcctx->user_fctx;
+
+	while (status->cur_xact < status->num_xacts)
+	{
+		FdwXact		fdw_xact = &status->fdw_xacts[status->cur_xact++];
+		Datum		values[6];
+		bool		nulls[6];
+		HeapTuple	tuple;
+		Datum		result;
+
+		if (!fdw_xact->valid)
+			continue;
+
+		/*
+		 * Form tuple with appropriate data.
+		 */
+		MemSet(values, 0, sizeof(values));
+		MemSet(nulls, 0, sizeof(nulls));
+
+		values[0] = ObjectIdGetDatum(fdw_xact->dboid);
+		values[1] = TransactionIdGetDatum(fdw_xact->local_xid);
+		values[2] = ObjectIdGetDatum(fdw_xact->serverid);
+		values[3] = ObjectIdGetDatum(fdw_xact->userid);
+		switch (fdw_xact->status)
+		{
+			case FDW_XACT_PREPARING:
+				xact_status = "prepared";
+				break;
+			case FDW_XACT_COMMITTING_PREPARED:
+				xact_status = "committing";
+				break;
+			case FDW_XACT_ABORTING_PREPARED:
+				xact_status = "aborting";
+				break;
+			default:
+				xact_status = "unknown";
+				break;
+		}
+		values[4] = CStringGetTextDatum(xact_status);
+		/* should this be really interpreted by FDW */
+		values[5] = PointerGetDatum(cstring_to_text_with_len(fdw_xact->fdw_xact_id,
+												 FDW_XACT_ID_LEN));
+
+		tuple = heap_form_tuple(funcctx->tuple_desc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+
+	SRF_RETURN_DONE(funcctx);
+}
+
+/*
+ * Returns an array of all foreign prepared transactions for the user-level
+ * function pg_fdw_xact.
+ *
+ * The returned array and all its elements are copies of internal data
+ * structures, to minimize the time we need to hold the FdwXactLock.
+ *
+ * WARNING -- we return even those transactions whose information is not
+ * completely filled yet. The caller should filter them out if he doesn't want them.
+ *
+ * The returned array is palloc'd.
+ */
+static int
+GetFdwXactList(FdwXact * fdw_xacts)
+{
+	int			num_xacts;
+	int			cnt_xacts;
+
+	LWLockAcquire(FdwXactLock, LW_SHARED);
+
+	if (FdwXactCtl->numFdwXacts == 0)
+	{
+		LWLockRelease(FdwXactLock);
+		*fdw_xacts = NULL;
+		return 0;
+	}
+
+	num_xacts = FdwXactCtl->numFdwXacts;
+	*fdw_xacts = (FdwXact) palloc(sizeof(FdwXactData) * num_xacts);
+	for (cnt_xacts = 0; cnt_xacts < num_xacts; cnt_xacts++)
+		memcpy((*fdw_xacts) + cnt_xacts, FdwXactCtl->fdw_xacts[cnt_xacts],
+			   sizeof(FdwXactData));
+
+	LWLockRelease(FdwXactLock);
+
+	return num_xacts;
+}
+
+/*
+ * Built-in function to resolve a prepared foreign transaction manually.
+ */
+Datum
+pg_resolve_fdw_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	local_xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	Oid				serverid = PG_GETARG_OID(1);
+	Oid				userid = PG_GETARG_OID(2);
+	FdwXact			fdwxact;
+	bool			ret;
+
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to resolve foreign transactions"))));
+
+	fdwxact = get_fdw_xact(local_xid, serverid, userid);
+
+	if (!fdwxact)
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 (errmsg("could not find foreign transaction entry"))));
+
+	ret = FdwXactResolveForeignTransaction(fdwxact,
+										   get_prepared_foreign_xact_resolver(fdwxact));
+
+	PG_RETURN_BOOL(ret);
+}
+
+/*
+ * Built-in function to remove a prepared foreign transaction entry without
+ * resolving. The function gives a way to forget about such prepared
+ * transaction in case: the foreign server where it is prepared is no longer
+ * available, the user which prepared this transaction needs to be dropped.
+ */
+Datum
+pg_remove_fdw_xact(PG_FUNCTION_ARGS)
+{
+	TransactionId	local_xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	Oid				serverid = PG_GETARG_OID(1);
+	Oid				userid = PG_GETARG_OID(2);
+	FdwXact			fdwxact;
+
+	if (!superuser())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+				 (errmsg("must be superuser to remove foreign transactions"))));
+
+	fdwxact = get_fdw_xact(local_xid, serverid, userid);
+
+	if (!fdwxact)
+		ereport(ERROR,
+				(errcode(ERRCODE_UNDEFINED_OBJECT),
+				 (errmsg("could not find foreign transaction entry"))));
+
+	remove_fdw_xact(fdwxact);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Code dealing with the on disk files used to store foreign transaction
+ * information.
+ */
+
+/*
+ * ReadFdwXactFile
+ * Read the foreign transction state file and return the contents in a
+ * structure allocated in-memory. The structure can be later freed by the
+ * caller.
+ */
+static FdwXactOnDiskData *
+ReadFdwXactFile(TransactionId xid, Oid serverid, Oid userid)
+{
+	char		path[MAXPGPATH];
+	int			fd;
+	FdwXactOnDiskData *fdw_xact_file_data;
+	struct stat stat;
+	uint32		crc_offset;
+	pg_crc32c	calc_crc;
+	pg_crc32c	file_crc;
+	char	   *buf;
+
+	FdwXactFilePath(path, xid, serverid, userid);
+
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+			   errmsg("could not open FDW transaction state file \"%s\": %m",
+					  path)));
+
+	/*
+	 * Check file length.  We can determine a lower bound pretty easily. We
+	 * set an upper bound to avoid palloc() failure on a corrupt file, though
+	 * we can't guarantee that we won't get an out of memory error anyway,
+	 * even on a valid file.
+	 */
+	if (fstat(fd, &stat))
+	{
+		CloseTransientFile(fd);
+
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not stat FDW transaction state file \"%s\": %m",
+					  path)));
+		return NULL;
+	}
+
+	if (stat.st_size < offsetof(FdwXactOnDiskData, fdw_xact_id) ||
+		stat.st_size > MaxAllocSize)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+				 errmsg("Too large FDW transaction state file \"%s\": %m",
+						path)));
+		return NULL;
+	}
+
+	buf = (char *) palloc(stat.st_size);
+	fdw_xact_file_data = (FdwXactOnDiskData *) buf;
+	crc_offset = stat.st_size - sizeof(pg_crc32c);
+	/* Slurp the file */
+	if (read(fd, fdw_xact_file_data, stat.st_size) != stat.st_size)
+	{
+		CloseTransientFile(fd);
+		ereport(WARNING,
+				(errcode_for_file_access(),
+			   errmsg("could not read FDW transaction state file \"%s\": %m",
+					  path)));
+		pfree(fdw_xact_file_data);
+		return NULL;
+	}
+
+	CloseTransientFile(fd);
+
+	/*
+	 * Check the CRC.
+	 */
+	INIT_CRC32C(calc_crc);
+	COMP_CRC32C(calc_crc, buf, crc_offset);
+	FIN_CRC32C(calc_crc);
+
+	file_crc = *((pg_crc32c *) (buf + crc_offset));
+
+	if (!EQ_CRC32C(calc_crc, file_crc))
+	{
+		pfree(buf);
+		return NULL;
+	}
+
+	if (fdw_xact_file_data->serverid != serverid ||
+		fdw_xact_file_data->userid != userid ||
+		fdw_xact_file_data->local_xid != xid)
+	{
+		ereport(WARNING,
+			(errmsg("removing corrupt foreign transaction state file \"%s\"",
+					path)));
+		CloseTransientFile(fd);
+		pfree(buf);
+		return NULL;
+	}
+
+	return fdw_xact_file_data;
+}
+
+/*
+ * PrescanFdwXacts
+ *
+ * Read the foreign prepared transactions directory for oldest active
+ * transaction. The transactions corresponding to the xids in this directory
+ * are not necessarily active per say locally. But we still need those XIDs to
+ * be alive so that
+ * 1. we can determine whether they are committed or aborted
+ * 2. the file name contains xid which shouldn't get used again to avoid
+ *	  conflicting file names.
+ *
+ * The function accepts the oldest active xid determined by other functions
+ * (e.g. PrescanPreparedTransactions()). It then compares every xid it comes
+ * across while scanning foreign prepared transactions directory with the oldest
+ * active xid. It returns the oldest of those xids or oldest active xid
+ * whichever is older.
+ *
+ * If any foreign prepared transaction is part of a future transaction (PITR),
+ * the function removes the corresponding file as
+ * 1. We can not know the status of the local transaction which prepared this
+ * foreign transaction
+ * 2. The foreign server or the user may not be available as per new timeline
+ *
+ * Anyway, the local transaction which prepared the foreign prepared transaction
+ * does not exist as per the new timeline, so it's better to forget the foreign
+ * prepared transaction as well.
+ */
+TransactionId
+PrescanFdwXacts(TransactionId oldestActiveXid)
+{
+	TransactionId nextXid = ShmemVariableCache->nextXid;
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			/*
+			 * Remove a foreign prepared transaction file corresponding to an
+			 * XID, which is too new.
+			 */
+			if (TransactionIdFollowsOrEquals(local_xid, nextXid))
+			{
+				ereport(WARNING,
+						(errmsg("removing future foreign prepared transaction file \"%s\"",
+								clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, true);
+				continue;
+			}
+
+			if (TransactionIdPrecedesOrEquals(local_xid, oldestActiveXid))
+				oldestActiveXid = local_xid;
+		}
+	}
+
+	FreeDir(cldir);
+	return oldestActiveXid;
+}
+
+/*
+ * RecoverFdwXacts
+ * Read the foreign prepared transaction information and set it up for further
+ * usage.
+ */
+void
+RecoverFdwXacts(void)
+{
+	DIR		   *cldir;
+	struct dirent *clde;
+
+	cldir = AllocateDir(FDW_XACTS_DIR);
+	while ((clde = ReadDir(cldir, FDW_XACTS_DIR)) != NULL)
+	{
+		if (strlen(clde->d_name) == FDW_XACT_FILE_NAME_LEN &&
+		 strspn(clde->d_name, "0123456789ABCDEF_") == FDW_XACT_FILE_NAME_LEN)
+		{
+			Oid			serverid;
+			Oid			userid;
+			TransactionId local_xid;
+			FdwXactOnDiskData *fdw_xact_file_data;
+			FdwXact		fdw_xact;
+
+			sscanf(clde->d_name, "%08x_%08x_%08x", &local_xid, &serverid,
+				   &userid);
+
+			fdw_xact_file_data = ReadFdwXactFile(local_xid, serverid, userid);
+
+			if (!fdw_xact_file_data)
+			{
+				ereport(WARNING,
+				  (errmsg("Removing corrupt foreign transaction file \"%s\"",
+						  clde->d_name)));
+				RemoveFdwXactFile(local_xid, serverid, userid, false);
+				continue;
+			}
+
+			ereport(LOG,
+					(errmsg("recovering foreign transaction entry for xid %u, foreign server %u and user %u",
+							local_xid, serverid, userid)));
+
+			fdw_xact = get_fdw_xact(local_xid, serverid, userid);
+
+			LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+			if (!fdw_xact)
+			{
+				/*
+				 * Add this entry into the table of foreign transactions. The
+				 * status of the transaction is set as preparing, since we do not
+				 * know the exact status right now. Resolver will set it later
+				 * based on the status of local transaction which prepared this
+				 * foreign transaction.
+				 */
+				fdw_xact = insert_fdw_xact(fdw_xact_file_data->dboid, local_xid,
+										   serverid, userid,
+										   fdw_xact_file_data->umid,
+										   fdw_xact_file_data->fdw_xact_id);
+				fdw_xact->locking_backend = MyBackendId;
+				fdw_xact->status = FDW_XACT_PREPARING;
+			}
+			else
+			{
+				Assert(fdw_xact->inredo);
+				fdw_xact->inredo = false;
+			}
+
+			/* Mark the entry as ready */
+			fdw_xact->valid = true;
+			/* Already synced to disk */
+			fdw_xact->ondisk = true;
+			pfree(fdw_xact_file_data);
+			LWLockRelease(FdwXactLock);
+		}
+	}
+
+	FreeDir(cldir);
+}
+
+/*
+ * Remove the foreign transaction file for given entry.
+ *
+ * If giveWarning is false, do not complain about file-not-present;
+ * this is an expected case during WAL replay.
+ */
+static void
+RemoveFdwXactFile(TransactionId xid, Oid serverid, Oid userid, bool giveWarning)
+{
+	char		path[MAXPGPATH];
+
+	FdwXactFilePath(path, xid, serverid, userid);
+	if (unlink(path))
+		if (errno != ENOENT || giveWarning)
+			ereport(WARNING,
+					(errcode_for_file_access(),
+					 errmsg("could not remove foreign transaction state file \"%s\": %m",
+							path)));
+}
+
+/*
+ * FdwXactRedoAdd
+ *
+ * Store pointer to the start/end of the WAL record along with the xid in
+ * a fdw_xact entry in shared memory FdwXactData structure.
+ */
+void
+FdwXactRedoAdd(XLogReaderState *record)
+{
+	FdwXactOnDiskData *fdw_xact_data = (FdwXactOnDiskData *) XLogRecGetData(record);
+	FdwXact fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	LWLockAcquire(FdwXactLock, LW_EXCLUSIVE);
+	fdw_xact = insert_fdw_xact(fdw_xact_data->dboid, fdw_xact_data->local_xid,
+							   fdw_xact_data->serverid, fdw_xact_data->userid,
+							   fdw_xact_data->umid, fdw_xact_data->fdw_xact_id);
+	fdw_xact->status = FDW_XACT_PREPARING;
+	fdw_xact->fdw_xact_start_lsn = record->ReadRecPtr;
+	fdw_xact->fdw_xact_end_lsn = record->EndRecPtr;
+	fdw_xact->inredo = true;
+	fdw_xact->valid = true;
+	LWLockRelease(FdwXactLock);
+}
+/*
+ * FdwXactRedoRemove
+ *
+ * Remove the corresponding fdw_xact entry from FdwXactCtl.
+ * Also remove fdw_xact file if a foreign transaction was saved
+ * via an earlier checkpoint.
+ */
+void
+FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+	FdwXact	fdw_xact;
+
+	Assert(RecoveryInProgress());
+
+	fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+	if (fdw_xact)
+	{
+		/* Now we can clean up any files we already left */
+		Assert(fdw_xact->inredo);
+		remove_fdw_xact(fdw_xact);
+	}
+	else
+	{
+		/*
+		 * Entry could be on disk. Call with giveWarning = false
+		 * since it can be expected during replay.
+		 */
+		RemoveFdwXactFile(xid, serverid, userid, false);
+	}
+}
diff --git a/src/backend/access/transam/fdwxact_resolver.c b/src/backend/access/transam/fdwxact_resolver.c
new file mode 100644
index 0000000..67f7d44
--- /dev/null
+++ b/src/backend/access/transam/fdwxact_resolver.c
@@ -0,0 +1,518 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver.c
+ *
+ * PostgreSQL foreign transaction resolver background worker
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/transam/fdwxact_resolver.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include <signal.h>
+#include <unistd.h>
+
+#include "access/xact.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
+#include "access/resolver_private.h"
+#include "access/transam.h"
+
+#include "funcapi.h"
+#include "libpq/libpq.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "storage/ipc.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+/* GUC parameters */
+int foreign_xact_resolution_interval;
+int foreign_xact_resolver_timeout = 60 * 1000;
+
+FdwXactRslvCtlData *FdwXactRslvCtl;
+
+static void FdwXactRslvLoop(void);
+static long FdwXactRslvComputeSleepTime(TimestampTz now);
+static void FdwXactRslvCheckTimeout(TimestampTz now);
+
+static void fdwxact_resolver_sighup(SIGNAL_ARGS);
+static void fdwxact_resolver_onexit(int code, Datum arg);
+
+/* Flags set by signal handlers */
+static volatile sig_atomic_t got_SIGHUP = false;
+
+/* Report shared memory space needed by FdwXactRsoverShmemInit */
+Size
+FdwXactRslvShmemSize(void)
+{
+	Size		size = 0;
+
+	size = add_size(size, mul_size(max_foreign_xact_resolvers,
+								   sizeof(FdwXactResolver)));
+
+	return size;
+}
+
+/*
+ * Allocate and initialize foreign transaction resolver shared
+ * memory.
+ */
+void
+FdwXactRslvShmemInit(void)
+{
+	bool found;
+
+	FdwXactRslvCtl = ShmemInitStruct("Foreign transactions resolvers",
+									 FdwXactRslvShmemSize(),
+									 &found);
+
+	if (!IsUnderPostmaster)
+	{
+		int	slot;
+
+		/* First time through, so initialize */
+		MemSet(FdwXactRslvCtl, 0, FdwXactRslvShmemSize());
+
+		SHMQueueInit(&(FdwXactRslvCtl->FdwXactQueue));
+
+		for (slot = 0; slot < max_foreign_xact_resolvers; slot++)
+		{
+			FdwXactResolver *resolver = &FdwXactRslvCtl->resolvers[slot];
+
+			SpinLockInit(&(resolver->mutex));
+		}
+	}
+}
+
+/*
+ * Cleanup up foreign transaction resolver info.
+ */
+static void
+fdwxact_resolver_onexit(int code, Datum arg)
+{
+	MyFdwXactResolver->pid = InvalidPid;
+	MyFdwXactResolver->in_use = false;
+}
+
+/*
+ * Attach to a slot.
+ */
+void
+fdwxact_resolver_attach(int slot)
+{
+	Assert(slot >= 0 && slot < max_foreign_xact_resolvers);
+
+	/* Block concurrent access */
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	MyFdwXactResolver = &FdwXactRslvCtl->resolvers[slot];
+
+	SpinLockAcquire(&(MyFdwXactResolver->mutex));
+	MyFdwXactResolver->pid = MyProcPid;
+	SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+	MyFdwXactResolver->latch = &MyProc->procLatch;
+
+	if (!MyFdwXactResolver->in_use)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("foreign transaction resolver slot %d is empty, cannot attach",
+						slot)));
+	}
+
+	before_shmem_exit(fdwxact_resolver_onexit, (Datum) 0);
+
+	LWLockRelease(FdwXactResolverLock);
+}
+
+/* Set flag to reload configuration at next convenient time */
+static void
+fdwxact_resolver_sighup(SIGNAL_ARGS)
+{
+	int		save_errno = errno;
+
+	got_SIGHUP = true;
+
+	SetLatch(MyLatch);
+
+	errno = save_errno;
+}
+
+/* Foreign transaction resolver entry point */
+void
+FdwXactRslvMain(Datum main_arg)
+{
+	int slot = DatumGetInt32(main_arg);
+
+	/* Attach to a slot */
+	fdwxact_resolver_attach(slot);
+
+	elog(DEBUG1, "foreign transaciton resolver for database %u started",
+		 MyFdwXactResolver->dbid);
+
+	/* Establish signal handlers */
+	pqsignal(SIGHUP, fdwxact_resolver_sighup);
+	pqsignal(SIGTERM, die);
+	BackgroundWorkerUnblockSignals();
+
+	/* Initialize stats to a sanish value */
+	MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+
+	/* Establish connection to nailed catalogs */
+	BackgroundWorkerInitializeConnectionByOid(MyFdwXactResolver->dbid, InvalidOid);
+
+	FdwXactRslvLoop();
+
+	proc_exit(0);
+}
+
+/*
+ * Fdwxact resolver main loop
+ */
+static
+void FdwXactRslvLoop(void)
+{
+	FdwXactResolveState *fstate;
+
+	/* Create an FdwXactResolveState */
+	fstate = CreateFdwXactResolveState();
+
+	for (;;)
+	{
+		int			rc;
+		TimestampTz	now;
+		long		sleep_time;
+		bool		resolved;
+
+		ResetLatch(MyLatch);
+
+		CHECK_FOR_INTERRUPTS();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+		}
+
+		/* Resolve a distributed transaction */
+		StartTransactionCommand();
+		resolved = FdwXactResolveDistributedTransaction(fstate);
+		CommitTransactionCommand();
+
+		if (resolved)
+		{
+			/* Update my state */
+			SpinLockAcquire(&(MyFdwXactResolver->mutex));
+			MyFdwXactResolver->last_resolved_time = GetCurrentTimestamp();
+			SpinLockRelease(&(MyFdwXactResolver->mutex));
+		}
+
+		now = GetCurrentTimestamp();
+
+		/* Check for fdwxact resolver timeout */
+		FdwXactRslvCheckTimeout(now);
+
+		/*
+		 * We don't block if we resolved any distributed transaction because
+		 * there might be other distributed transactions waiting to be resolved
+		 */
+		if (!resolved)
+		{
+			FdwXactResolveDanglingTransactions(MyDatabaseId);
+
+			sleep_time = FdwXactRslvComputeSleepTime(now);
+
+			rc = WaitLatch(MyLatch,
+						   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+						   sleep_time,
+						   WAIT_EVENT_FDW_XACT_RESOLVER_MAIN);
+
+			if (rc & WL_POSTMASTER_DEATH)
+				proc_exit(1);
+		}
+	}
+}
+
+/*
+ * Check whether there have been foreign transactions by the backend within
+ * foreign_xact_resolver_timeout and shutdown if not.
+ */
+static
+void FdwXactRslvCheckTimeout(TimestampTz now)
+{
+	if (foreign_xact_resolver_timeout > 0)
+	{
+		TimestampTz timeout;
+
+		timeout = TimestampTzPlusMilliseconds(MyFdwXactResolver->last_resolved_time,
+											  foreign_xact_resolver_timeout);
+
+		/*
+		 * We reached to the timeout here. We can exit if there is on
+		 * pending foreign transactions in the shmem queue. Check it and
+		 * then close the business while holding FdwXactResolverLaunchLock.
+		 */
+		if (now >= timeout)
+		{
+			if (!fdw_xact_exists(InvalidTransactionId, MyDatabaseId, InvalidOid,
+								 InvalidOid))
+			{
+				/*
+				 * There is no more transactions we need to resolve,
+				 * turn off my slot while holding lock so that concurrent
+				 * backends cannot register additional entries.
+				 */
+				SpinLockAcquire(&(MyFdwXactResolver->mutex));
+				MyFdwXactResolver->in_use = false;
+				SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+				ereport(LOG,
+						(errmsg("foreign transaction resolver for database \"%u\" will stop because the timeout",
+								MyFdwXactResolver->dbid)));
+
+				proc_exit(0);
+			}
+		}
+	}
+}
+
+/*
+ * Compute how long we should sleep by the next cycle. Return the sleep time
+ * in milliseconds, -1 means that we reached to the timeout and should exits
+ */
+static long
+FdwXactRslvComputeSleepTime(TimestampTz now)
+{
+	static TimestampTz	wakeuptime = 0;
+	long	sleeptime;
+	long	sec_to_timeout;
+	int		microsec_to_timeout;
+
+	if (now >= wakeuptime)
+		wakeuptime = TimestampTzPlusMilliseconds(now,
+												 foreign_xact_resolution_interval * 1000);
+
+	/* Compute relative time until wakeup. */
+	TimestampDifference(now, wakeuptime,
+						&sec_to_timeout, &microsec_to_timeout);
+
+	sleeptime = sec_to_timeout * 1000 + microsec_to_timeout / 1000;
+
+	return sleeptime;
+}
+
+/*
+ * Launch a new foreign transaction resolver worker if not launched yet.
+ * A foreign transaction resolver worker is responsible for resolution of
+ * foreign transaction that are registered on a database. So if a resolver
+ * worker already is launched, we don't need to launch new one.
+ */
+void
+fdwxact_maybe_launch_resolver(void)
+{
+	FdwXactResolver *resolver = NULL;
+	BackgroundWorker bgw;
+	BackgroundWorkerHandle *bgw_handle;
+	int i;
+	int	slot;
+	bool	found = false;
+
+	LWLockAcquire(FdwXactResolverLock, LW_EXCLUSIVE);
+
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/*
+		 * Found a running resolver that is responsible for the
+		 * database "dbid".
+		 */
+		if (r->in_use && r->pid != InvalidPid && r->dbid == MyDatabaseId)
+		{
+			found = true;
+			resolver = r;
+			break;
+		}
+	}
+
+	/*
+	 * If we found the resolver for my database, we don't need to launch new one.
+	 * Add a task and wake up it.
+	 */
+	if (found)
+	{
+		SetLatch(resolver->latch);
+		LWLockRelease(FdwXactResolverLock);
+
+		elog(DEBUG1, "found a running foreign transaction resolver process for database %u",
+			 MyDatabaseId);
+
+		return;
+	}
+
+	elog(DEBUG1, "starting foreign transaction resolver for datbase ID %u", MyDatabaseId);
+
+	/* Find an unused worker slot */
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver *r = &FdwXactRslvCtl->resolvers[i];
+
+		/* Found an used worker slot */
+		if (!r->in_use)
+		{
+			resolver = r;
+			slot = i;
+			break;
+		}
+	}
+
+	/*
+	 * However if there are no more free worker slots, inform user about it before
+	 * exiting.
+	 */
+	if (resolver == NULL)
+	{
+		LWLockRelease(FdwXactResolverLock);
+		ereport(ERROR,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of foreign trasanction resolver slots"),
+				 errhint("You might need to increase max_foreign_transaction_resolvers.")));
+
+		return;
+	}
+
+	/* Prepare the resolver slot. It's in use but pid is still invalid */
+	resolver->dbid = MyDatabaseId;
+	resolver->in_use = true;
+	resolver->pid = InvalidPid;
+	TIMESTAMP_NOBEGIN(resolver->last_resolved_time);
+
+	LWLockRelease(FdwXactResolverLock);
+
+	/* Register the new dynamic worker */
+	memset(&bgw, 0, sizeof(bgw));
+	bgw.bgw_flags = BGWORKER_SHMEM_ACCESS |
+		BGWORKER_BACKEND_DATABASE_CONNECTION;
+	bgw.bgw_start_time = BgWorkerStart_RecoveryFinished;
+	snprintf(bgw.bgw_library_name, BGW_MAXLEN, "postgres");
+	snprintf(bgw.bgw_function_name, BGW_MAXLEN, "FdwXactRslvMain");
+	snprintf(bgw.bgw_name, BGW_MAXLEN,
+			 "foreign transaction resolver for database %u", MyDatabaseId);
+	snprintf(bgw.bgw_type, BGW_MAXLEN, "foreign transaction resolver");
+	bgw.bgw_restart_time = BGW_NEVER_RESTART;
+	bgw.bgw_main_arg = (Datum) 0;
+	bgw.bgw_notify_pid = Int32GetDatum(slot);
+
+	if (!RegisterDynamicBackgroundWorker(&bgw, &bgw_handle))
+	{
+		/* Failed to launch, cleanup the worker slot */
+		SpinLockAcquire(&(MyFdwXactResolver->mutex));
+		resolver->in_use = false;
+		SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+		ereport(WARNING,
+				(errcode(ERRCODE_CONFIGURATION_LIMIT_EXCEEDED),
+				 errmsg("out of background worker slots"),
+				 errhint("You might need to increase max_worker_processes.")));
+	}
+
+	/*
+	 * We don't need to wait until it attaches here because we're going to wait
+	 * until all foreign transactions are resolved.
+	 */
+}
+
+/*
+ * Returns activity of foreign transaction resolvers, including pids, the number
+ * of tasks and the last resolution time.
+ */
+Datum
+pg_stat_get_fdwxact_resolver(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_FDWXACT_RESOLVERS_COLS 3
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	int i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	for (i = 0; i < max_foreign_xact_resolvers; i++)
+	{
+		FdwXactResolver	*resolver = &FdwXactRslvCtl->resolvers[i];
+		pid_t	pid;
+		Oid		dbid;
+		TimestampTz last_resolved_time;
+		Datum		values[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+		bool		nulls[PG_STAT_GET_FDWXACT_RESOLVERS_COLS];
+
+
+		SpinLockAcquire(&(MyFdwXactResolver->mutex));
+		if (resolver->pid == 0)
+		{
+			SpinLockRelease(&(MyFdwXactResolver->mutex));
+			continue;
+		}
+
+		pid = resolver->pid;
+		dbid = resolver->dbid;
+		last_resolved_time = resolver->last_resolved_time;
+		SpinLockRelease(&(MyFdwXactResolver->mutex));
+
+		memset(nulls, 0, sizeof(nulls));
+		/* pid */
+		values[0] = Int32GetDatum(pid);
+
+		/* dbid */
+		values[1] = ObjectIdGetDatum(dbid);
+
+		/* last_resolved_time */
+		if (last_resolved_time == 0)
+			nulls[2] = true;
+		else
+			values[2] = TimestampTzGetDatum(last_resolved_time);
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	/* clean up and return the tuplestore */
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b360b1 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -9,6 +9,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
 #include "access/generic_xlog.h"
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index c479c48..90b4691 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -77,6 +77,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "access/subtrans.h"
 #include "access/transam.h"
@@ -845,6 +846,35 @@ TwoPhaseGetGXact(TransactionId xid)
 }
 
 /*
+ * TwoPhaseExists
+ *		Return true if there is a prepared transaction specified by XID
+ */
+bool
+TwoPhaseExists(TransactionId xid)
+{
+	int		i;
+	bool	found = false;
+
+	LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
+		PGXACT	*pgxact = &ProcGlobal->allPgXact[gxact->pgprocno];
+
+		if (pgxact->xid == xid)
+		{
+			found = true;
+			break;
+		}
+	}
+
+	LWLockRelease(TwoPhaseStateLock);
+
+	return found;
+}
+
+/*
  * TwoPhaseGetDummyProc
  *		Get the dummy backend ID for prepared transaction specified by XID
  *
@@ -2240,6 +2270,12 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, true);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitToBeResolved(xid, true);
 }
 
 /*
@@ -2298,6 +2334,12 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	 * in the procarray and continue to hold locks.
 	 */
 	SyncRepWaitForLSN(recptr, false);
+
+	/*
+	 * Wait for foreign transaction prepared as part of this prepared
+	 * transaction to be committed.
+	 */
+	FdwXactWaitToBeResolved(xid, false);
 }
 
 /*
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index ea81f4b..74eb83a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -21,6 +21,7 @@
 #include <unistd.h>
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/parallel.h"
 #include "access/subtrans.h"
@@ -1129,6 +1130,7 @@ RecordTransactionCommit(void)
 	SharedInvalidationMessage *invalMessages = NULL;
 	bool		RelcacheInitFileInval = false;
 	bool		wrote_xlog;
+	bool		need_twophase;
 
 	/* Get data needed for commit record */
 	nrels = smgrGetPendingDeletes(true, &rels);
@@ -1137,6 +1139,7 @@ RecordTransactionCommit(void)
 		nmsgs = xactGetCommittedInvalidationMessages(&invalMessages,
 													 &RelcacheInitFileInval);
 	wrote_xlog = (XactLastRecEnd != 0);
+	need_twophase = TwoPhaseCommitRequired();
 
 	/*
 	 * If we haven't been assigned an XID yet, we neither can, nor do we want
@@ -1175,12 +1178,13 @@ RecordTransactionCommit(void)
 		}
 
 		/*
-		 * If we didn't create XLOG entries, we're done here; otherwise we
-		 * should trigger flushing those entries the same as a commit record
+		 * If we didn't create XLOG entries and the transaction does not need
+		 * to be committed using two-phase commit. we're done here; otherwise
+		 * we should trigger flushing those entries the same as a commit record
 		 * would.  This will primarily happen for HOT pruning and the like; we
 		 * want these to be flushed to disk in due time.
 		 */
-		if (!wrote_xlog)
+		if (!wrote_xlog && !need_twophase)
 			goto cleanup;
 	}
 	else
@@ -1338,6 +1342,14 @@ RecordTransactionCommit(void)
 	if (wrote_xlog && markXidCommitted)
 		SyncRepWaitForLSN(XactLastRecEnd, true);
 
+	/*
+	 * Wait for prepared foreign transaction to be resolved, if required.
+	 * We only want to wait if we prepared foreign transaction in this
+	 * transaction.
+	 */
+	if (need_twophase && markXidCommitted)
+		FdwXactWaitToBeResolved(xid, true);
+
 	/* remember end of last commit record */
 	XactLastCommitEnd = XactLastRecEnd;
 
@@ -1975,6 +1987,9 @@ CommitTransaction(void)
 			break;
 	}
 
+	/* Pre-commit step for foreign transactions */
+	PreCommit_FdwXacts();
+
 	CallXactCallbacks(is_parallel_worker ? XACT_EVENT_PARALLEL_PRE_COMMIT
 					  : XACT_EVENT_PRE_COMMIT);
 
@@ -2130,6 +2145,7 @@ CommitTransaction(void)
 	AtEOXact_PgStat(true);
 	AtEOXact_Snapshot(true, false);
 	AtEOXact_ApplyLauncher(true);
+	AtEOXact_FdwXacts(true);
 	pgstat_report_xact_timestamp(0);
 
 	CurrentResourceOwner = NULL;
@@ -2217,6 +2233,8 @@ PrepareTransaction(void)
 	 * the transaction-abort path.
 	 */
 
+	AtPrepare_FdwXacts();
+
 	/* Shut down the deferred-trigger manager */
 	AfterTriggerEndXact(true);
 
@@ -2405,6 +2423,7 @@ PrepareTransaction(void)
 	AtEOXact_Files();
 	AtEOXact_ComboCid();
 	AtEOXact_HashTables(true);
+	AtEOXact_FdwXacts(true);
 	/* don't call AtEOXact_PgStat here; we fixed pgstat state above */
 	AtEOXact_Snapshot(true, true);
 	pgstat_report_xact_timestamp(0);
@@ -2609,6 +2628,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtEOXact_FdwXacts(false);
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 18b7471..69184d6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -24,6 +24,7 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/multixact.h"
 #include "access/rewriteheap.h"
 #include "access/subtrans.h"
@@ -5157,6 +5158,7 @@ BootStrapXLOG(void)
 	ControlFile->MaxConnections = MaxConnections;
 	ControlFile->max_worker_processes = max_worker_processes;
 	ControlFile->max_prepared_xacts = max_prepared_xacts;
+	ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 	ControlFile->max_locks_per_xact = max_locks_per_xact;
 	ControlFile->wal_level = wal_level;
 	ControlFile->wal_log_hints = wal_log_hints;
@@ -6244,6 +6246,9 @@ CheckRequiredParameterValues(void)
 		RecoveryRequiresIntParameter("max_prepared_transactions",
 									 max_prepared_xacts,
 									 ControlFile->max_prepared_xacts);
+		RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+									 max_prepared_foreign_xacts,
+									 ControlFile->max_prepared_foreign_xacts);
 		RecoveryRequiresIntParameter("max_locks_per_transaction",
 									 max_locks_per_xact,
 									 ControlFile->max_locks_per_xact);
@@ -6928,8 +6933,12 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
+
 			if (wasShutdown)
+			{
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+				oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
+			}
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
 			Assert(TransactionIdIsValid(oldestActiveXID));
@@ -7554,6 +7563,7 @@ StartupXLOG(void)
 
 	/* Pre-scan prepared transactions to find out the range of XIDs present */
 	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
+	oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7740,6 +7750,8 @@ StartupXLOG(void)
 	/* Reload shared-memory state for prepared transactions */
 	RecoverPreparedTransactions();
 
+	RecoverFdwXacts();
+
 	/*
 	 * Shutdown the recovery environment. This must occur after
 	 * RecoverPreparedTransactions(), see notes for lock_twophase_recover()
@@ -9045,6 +9057,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
+	CheckPointFdwXacts(checkPointRedo);
 }
 
 /*
@@ -9481,7 +9494,8 @@ XLogReportParameters(void)
 		max_worker_processes != ControlFile->max_worker_processes ||
 		max_prepared_xacts != ControlFile->max_prepared_xacts ||
 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
-		track_commit_timestamp != ControlFile->track_commit_timestamp)
+		track_commit_timestamp != ControlFile->track_commit_timestamp ||
+		max_prepared_foreign_xacts != ControlFile->max_prepared_foreign_xacts)
 	{
 		/*
 		 * The change in number of backend slots doesn't need to be WAL-logged
@@ -9513,6 +9527,7 @@ XLogReportParameters(void)
 		ControlFile->MaxConnections = MaxConnections;
 		ControlFile->max_worker_processes = max_worker_processes;
 		ControlFile->max_prepared_xacts = max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = max_locks_per_xact;
 		ControlFile->wal_level = wal_level;
 		ControlFile->wal_log_hints = wal_log_hints;
@@ -9710,6 +9725,7 @@ xlog_redo(XLogReaderState *record)
 			RunningTransactionsData running;
 
 			oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
+			oldestActiveXID = PrescanFdwXacts(oldestActiveXID);
 
 			/*
 			 * Construct a RunningTransactions snapshot representing a shut
@@ -9899,6 +9915,7 @@ xlog_redo(XLogReaderState *record)
 		ControlFile->MaxConnections = xlrec.MaxConnections;
 		ControlFile->max_worker_processes = xlrec.max_worker_processes;
 		ControlFile->max_prepared_xacts = xlrec.max_prepared_xacts;
+		ControlFile->max_prepared_foreign_xacts = xlrec.max_prepared_foreign_xacts;
 		ControlFile->max_locks_per_xact = xlrec.max_locks_per_xact;
 		ControlFile->wal_level = xlrec.wal_level;
 		ControlFile->wal_log_hints = xlrec.wal_log_hints;
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 5652e9e..6efd205 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -291,6 +291,9 @@ CREATE VIEW pg_prepared_xacts AS
 CREATE VIEW pg_prepared_statements AS
     SELECT * FROM pg_prepared_statement() AS P;
 
+CREATE VIEW pg_prepared_fdw_xacts AS
+       SELECT * FROM pg_prepared_fdw_xacts() AS F;
+
 CREATE VIEW pg_seclabels AS
 SELECT
 	l.objoid, l.classoid, l.objsubid,
@@ -769,6 +772,14 @@ CREATE VIEW pg_stat_subscription AS
             LEFT JOIN pg_stat_get_subscription(NULL) st
                       ON (st.subid = su.oid);
 
+CREATE VIEW pg_stat_fdwxact_resolvers AS
+    SELECT
+            r.pid,
+            r.dbid,
+            r.last_resolved_time
+    FROM pg_stat_get_fdwxact_resolver() r
+    WHERE r.pid IS NOT NULL;
+
 CREATE VIEW pg_stat_ssl AS
     SELECT
             S.pid,
diff --git a/src/backend/commands/foreigncmds.c b/src/backend/commands/foreigncmds.c
index 5c53aee..a0ea20b 100644
--- a/src/backend/commands/foreigncmds.c
+++ b/src/backend/commands/foreigncmds.c
@@ -13,6 +13,7 @@
  */
 #include "postgres.h"
 
+#include "access/fdwxact.h"
 #include "access/heapam.h"
 #include "access/htup_details.h"
 #include "access/reloptions.h"
@@ -1093,6 +1094,14 @@ RemoveForeignServerById(Oid srvId)
 	if (!HeapTupleIsValid(tp))
 		elog(ERROR, "cache lookup failed for foreign server %u", srvId);
 
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+	{
+		Form_pg_foreign_server srvForm = (Form_pg_foreign_server) GETSTRUCT(tp);
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transactions on it",
+						NameStr(srvForm->srvname))));
+	}
+
 	CatalogTupleDelete(rel, &tp->t_self);
 
 	ReleaseSysCache(tp);
@@ -1403,6 +1412,17 @@ RemoveUserMapping(DropUserMappingStmt *stmt)
 	user_mapping_ddl_aclcheck(useId, srv->serverid, srv->servername);
 
 	/*
+	 * If there is a foreign prepared transaction with this user mapping,
+	 * dropping the user mapping might result in dangling prepared
+	 * transaction.
+	 */
+	if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srv->serverid,
+						useId))
+		ereport(ERROR,
+				(errmsg("server \"%s\" has unresolved prepared transaction for user \"%s\"",
+						srv->servername, MappingUserName(useId))));
+
+	/*
 	 * Do the deletion
 	 */
 	object.classId = UserMappingRelationId;
diff --git a/src/backend/postmaster/bgworker.c b/src/backend/postmaster/bgworker.c
index f651bb4..9488313 100644
--- a/src/backend/postmaster/bgworker.c
+++ b/src/backend/postmaster/bgworker.c
@@ -16,6 +16,7 @@
 
 #include "libpq/pqsignal.h"
 #include "access/parallel.h"
+#include "access/fdwxact_resolver.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -129,6 +130,9 @@ static const struct
 	},
 	{
 		"ApplyWorkerMain", ApplyWorkerMain
+	},
+	{
+		"FdwXactRslvMain", FdwXactRslvMain
 	}
 };
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 96ba216..558119a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3676,6 +3676,12 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_FDW_XACT_RESOLUTION:
+			event_name = "FdwXactResolution";
+			break;
+		case WAIT_EVENT_FDW_XACT_RESOLVER_MAIN:
+			event_name = "FdwXactResolver";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f3ddf82..cb3b288 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -95,6 +95,7 @@
 
 #include "access/transam.h"
 #include "access/xlog.h"
+#include "access/fdwxact_resolver.h"
 #include "bootstrap/bootstrap.h"
 #include "catalog/pg_control.h"
 #include "common/ip.h"
@@ -899,6 +900,10 @@ PostmasterMain(int argc, char *argv[])
 		ereport(ERROR,
 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"replica\" or \"logical\"")));
 
+	if (max_prepared_foreign_xacts > 0 && max_foreign_xact_resolvers == 0)
+		ereport(ERROR,
+				(errmsg("preparing foreign transactions (max_prepared_foreign_transactions > 0) requires maX_foreign_xact_resolvers > 0")));
+
 	/*
 	 * Other one-time internal sanity checks can go here, if they are fast.
 	 * (Put any slow processing further down, after postmaster.pid creation.)
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 6eb0d55..4c0a7fc 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -153,6 +153,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_FDW_XACT_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a58..60155cc 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -16,6 +16,8 @@
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
+#include "access/fdwxact_resolver.h"
 #include "access/heapam.h"
 #include "access/multixact.h"
 #include "access/nbtree.h"
@@ -150,6 +152,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, SyncScanShmemSize());
 		size = add_size(size, AsyncShmemSize());
 		size = add_size(size, BackendRandomShmemSize());
+		size = add_size(size, FdwXactShmemSize());
+		size = add_size(size, FdwXactRslvShmemSize());
 #ifdef EXEC_BACKEND
 		size = add_size(size, ShmemBackendArraySize());
 #endif
@@ -270,6 +274,8 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FdwXactShmemInit();
+	FdwXactRslvShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..a42d06e 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,5 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+FdwXactLock					46
+FdwXactResolverLock			47
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6f30e08..dbac820 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -35,6 +35,7 @@
 #include <unistd.h>
 #include <sys/time.h>
 
+#include "access/fdwxact.h"
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
@@ -397,6 +398,10 @@ InitProcess(void)
 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
 	SHMQueueElemInit(&(MyProc->syncRepLinks));
 
+	/* initialize fields for fdw xact */
+	MyProc->fdwXactState = FDW_XACT_NOT_WAITING;
+	SHMQueueElemInit(&(MyProc->fdwXactLinks));
+
 	/* Initialize fields for group XID clearing. */
 	MyProc->procArrayGroupMember = false;
 	MyProc->procArrayGroupMemberXid = InvalidTransactionId;
@@ -797,6 +802,9 @@ ProcKill(int code, Datum arg)
 	/* Make sure we're out of the sync rep lists */
 	SyncRepCleanupAtProcExit();
 
+	/* Make sure we're out of the fdwxact lists */
+	FdwXactCleanupAtProcExit();
+
 #ifdef USE_ASSERT_CHECKING
 	{
 		int			i;
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 87ba676..c74c108 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -27,6 +27,7 @@
 #endif
 
 #include "access/commit_ts.h"
+#include "access/fdwxact.h"
 #include "access/gin.h"
 #include "access/rmgr.h"
 #include "access/transam.h"
@@ -2093,6 +2094,51 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	/*
+	 * See also CheckRequiredParameterValues() if this parameter changes
+	 */
+	{
+		{"max_prepared_foreign_transactions", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Sets the maximum number of simultaneously prepared transactions on foreign servers."),
+			NULL
+		},
+		&max_prepared_foreign_xacts,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolver_timeout", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Sets the maximum time to wait for foreign transaction resolution."),
+			NULL,
+			GUC_UNIT_MS
+		},
+		&foreign_xact_resolver_timeout,
+		60 * 1000, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"max_foreign_transaction_resolvers", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Maximum number of foreign transaction resolution processes."),
+			NULL
+		},
+		&max_foreign_xact_resolvers,
+		0, 0, INT_MAX,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"foreign_transaction_resolution_interval", PGC_SIGHUP, RESOURCES_ASYNCHRONOUS,
+		 gettext_noop("Sets the maximum interval between resolving foreign transactions."),
+		 NULL,
+		 GUC_UNIT_S
+		},
+		&foreign_xact_resolution_interval,
+		10, 0, INT_MAX / 1000,
+		NULL, NULL, NULL
+	},
+
 #ifdef LOCK_DEBUG
 	{
 		{"trace_lock_oidmin", PGC_SUSET, DEVELOPER_OPTIONS,
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9a35355..3caa212 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -119,6 +119,8 @@
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
+#max_prepared_foreign_transactions = 0	# zero disables the feature
+					# (change requires restart)
 # Caution: it is not advisable to set max_prepared_transactions nonzero unless
 # you actively intend to use prepared transactions.
 #work_mem = 4MB				# min 64kB
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index ad06e8e..ca3eb62 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -81,6 +81,8 @@ provider postgresql {
 	probe multixact__checkpoint__done(bool);
 	probe twophase__checkpoint__start();
 	probe twophase__checkpoint__done();
+	probe fdwxact__checkpoint__start();
+	probe fdwxact__checkpoint__done();
 
 	probe smgr__md__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
 	probe smgr__md__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, int, int);
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 2efd3b7..c28f48c 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,6 +208,7 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_fdw_xact",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index cc73b7d..5b489c0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -284,6 +284,8 @@ main(int argc, char *argv[])
 		   ControlFile->max_worker_processes);
 	printf(_("max_prepared_xacts setting:           %d\n"),
 		   ControlFile->max_prepared_xacts);
+	printf(_("max_prepared_foreign_xacts setting:   %d\n"),
+		   ControlFile->max_prepared_foreign_xacts);
 	printf(_("max_locks_per_xact setting:           %d\n"),
 		   ControlFile->max_locks_per_xact);
 	printf(_("track_commit_timestamp setting:       %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index a132cf2..1029276 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -672,6 +672,7 @@ GuessControlValues(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	ControlFile.maxAlign = MAXIMUM_ALIGNOF;
@@ -893,6 +894,7 @@ RewriteControlFile(void)
 	ControlFile.MaxConnections = 100;
 	ControlFile.max_worker_processes = 8;
 	ControlFile.max_prepared_xacts = 0;
+	ControlFile.max_prepared_foreign_xacts = 0;
 	ControlFile.max_locks_per_xact = 64;
 
 	/* Now we can force the recorded xlog seg size to the right thing. */
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..b616cea 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -11,6 +11,7 @@
 #include "access/brin_xlog.h"
 #include "access/clog.h"
 #include "access/commit_ts.h"
+#include "access/fdwxact_xlog.h"
 #include "access/generic_xlog.h"
 #include "access/ginxlog.h"
 #include "access/gistxlog.h"
diff --git a/src/include/access/fdwxact.h b/src/include/access/fdwxact.h
new file mode 100644
index 0000000..9bbe319
--- /dev/null
+++ b/src/include/access/fdwxact.h
@@ -0,0 +1,134 @@
+/*
+ * fdwxact.h
+ *
+ * PostgreSQL distributed transaction manager
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact.h
+ */
+#ifndef FDW_XACT_H
+#define FDW_XACT_H
+
+#include "access/fdwxact_xlog.h"
+#include "access/xlogreader.h"
+#include "foreign/foreign.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "storage/backendid.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/timeout.h"
+#include "utils/timestamp.h"
+
+#define	FDW_XACT_NOT_WAITING		0
+#define	FDW_XACT_WAITING			1
+#define	FDW_XACT_WAIT_COMPLETE		2
+
+#define FdwXactEnabled() (max_prepared_foreign_xacts > 0)
+
+/* Shared memory entry for a prepared or being prepared foreign transaction */
+typedef struct FdwXactData *FdwXact;
+
+/* Enum to track the status of prepared foreign transaction */
+typedef enum
+{
+	FDW_XACT_PREPARING,			/* foreign transaction is (being) prepared */
+	FDW_XACT_COMMITTING_PREPARED,		/* foreign prepared transaction is to
+										 * be committed */
+	FDW_XACT_ABORTING_PREPARED, /* foreign prepared transaction is to be
+								 * aborted */
+	FDW_XACT_RESOLVED
+} FdwXactStatus;
+
+typedef struct FdwXactData
+{
+	FdwXact		fx_free_next;	/* Next free FdwXact entry */
+	FdwXact		fx_next;		/* Next FdwXact entry accosiated with the same
+								   transaction */
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;	/* XID of local transaction */
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping id for connection key */
+	FdwXactStatus status;		/* The state of the foreign
+								 * transaction. This doubles as the
+								 * action to be taken on this entry. */
+
+	/*
+	 * Note that we need to keep track of two LSNs for each FdwXact. We keep
+	 * track of the start LSN because this is the address we must use to read
+	 * state data back from WAL when committing a FdwXact. We keep track of
+	 * the end LSN because that is the LSN we need to wait for prior to
+	 * commit.
+	 */
+	XLogRecPtr	fdw_xact_start_lsn;		/* XLOG offset of inserting this entry start */
+	XLogRecPtr	fdw_xact_end_lsn;		/* XLOG offset of inserting this entry end */
+
+	bool		valid; /* Has the entry been complete and written to file? */
+	BackendId	locking_backend;	/* Backend working on this entry */
+	bool		ondisk;			/* TRUE if prepare state file is on disk */
+	bool		inredo;			/* TRUE if entry was added via xlog_redo */
+	char		fdw_xact_id[FDW_XACT_ID_LEN];		/* prepared transaction identifier */
+}	FdwXactData;
+
+/* Shared memory layout for maintaining foreign prepared transaction entries. */
+typedef struct
+{
+	/* Head of linked list of free FdwXactData structs */
+	FdwXact		freeFdwXacts;
+
+	/* Number of valid foreign transaction entries */
+	int			numFdwXacts;
+
+	/* Upto max_prepared_foreign_xacts entries in the array */
+	FdwXact		fdw_xacts[FLEXIBLE_ARRAY_MEMBER];		/* Variable length array */
+}	FdwXactCtlData;
+
+/* Pointer to the shared memory holding the foreign transactions data */
+FdwXactCtlData *FdwXactCtl;
+
+/* Struct for foreign transaction resolution */
+typedef struct FdwXactResolveState
+{
+	Oid				dbid;		/* Database oid */
+	TransactionId	wait_xid;	/* Local transaction id waiting to be resolved */
+	PGPROC			*waiter;	/* Backend process waiter */
+	FdwXact			fdwxact;	/* Foreign transaction entries to resolve */
+} FdwXactResolveState;
+
+/* GUC parameters */
+extern int	max_prepared_foreign_xacts;
+extern int	max_foreign_xact_resolvers;
+extern int	foreign_xact_resolution_interval;
+extern int	foreign_xact_resolver_timeout;
+
+extern Size FdwXactShmemSize(void);
+extern void FdwXactShmemInit(void);
+extern void RecoverFdwXacts(void);
+extern void FdwXactRegisterForeignServer(Oid serverid, Oid userid, bool can_prepare,
+										 bool modify);
+extern TransactionId PrescanFdwXacts(TransactionId oldestActiveXid);
+extern bool fdw_xact_has_usermapping(Oid serverid, Oid userid);
+extern bool fdw_xact_has_server(Oid serverid);
+extern void AtEOXact_FdwXacts(bool is_commit);
+extern void AtPrepare_FdwXacts(void);
+extern bool fdw_xact_exists(TransactionId xid, Oid dboid, Oid serverid,
+				Oid userid);
+extern void CheckPointFdwXacts(XLogRecPtr redo_horizon);
+extern bool FdwTwoPhaseNeeded(void);
+extern void PreCommit_FdwXacts(void);
+extern void FdwXactRedoAdd(XLogReaderState *record);
+extern void FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid);
+extern void KnownFdwXactRecreateFiles(XLogRecPtr redo_horizon);
+extern void FdwXactWaitToBeResolved(TransactionId wait_xid, bool commit);
+extern bool FdwXactResolveDistributedTransaction(FdwXactResolveState *fstate);
+extern bool FdwXactResolveDanglingTransactions(Oid dbid);
+extern bool TwoPhaseCommitRequired(void);
+extern FdwXactResolveState *CreateFdwXactResolveState(void);
+extern void FdwXactCleanupAtProcExit(void);
+
+#endif   /* FDW_XACT_H */
diff --git a/src/include/access/fdwxact_resolver.h b/src/include/access/fdwxact_resolver.h
new file mode 100644
index 0000000..ee03cb3
--- /dev/null
+++ b/src/include/access/fdwxact_resolver.h
@@ -0,0 +1,27 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_resolver.h
+ *	  PostgreSQL foreign transaction resolver definitions
+ *
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_resolver.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_RESOLVER_H
+#define FDWXACT_RESOLVER_H
+
+#include "access/fdwxact.h"
+
+extern void FdwXactRslvMain(Datum main_arg);
+extern Size FdwXactRslvShmemSize(void);
+extern void FdwXactRslvShmemInit(void);
+
+extern void fdwxact_resolver_attach(int slot);
+extern void fdwxact_maybe_launch_resolver(void);
+
+extern int foreign_xact_resolver_timeout;
+
+#endif		/* FDWXACT_RESOLVER_H */
diff --git a/src/include/access/fdwxact_xlog.h b/src/include/access/fdwxact_xlog.h
new file mode 100644
index 0000000..1d50d00
--- /dev/null
+++ b/src/include/access/fdwxact_xlog.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * fdwxact_xlog.h
+ *	  Foreign transaction XLOG definitions.
+ *
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/fdwxact_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef FDWXACT_XLOG_H
+#define FDWXACT_XLOG_H
+
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* Info types for logs related to FDW transactions */
+#define XLOG_FDW_XACT_INSERT	0x00
+#define XLOG_FDW_XACT_REMOVE	0x10
+
+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)
+/*
+ * On disk file structure
+ */
+typedef struct
+{
+	Oid			dboid;			/* database oid where to find foreign server
+								 * and user mapping */
+	TransactionId local_xid;
+	Oid			serverid;		/* foreign server where transaction takes
+								 * place */
+	Oid			userid;			/* user who initiated the foreign transaction */
+	Oid			umid;			/* user mapping oid */
+	char		fdw_xact_id[FDW_XACT_ID_LEN]; /* foreign txn prepare id */
+}	FdwXactOnDiskData;
+
+typedef struct
+{
+	TransactionId xid;
+	Oid			serverid;
+	Oid			userid;
+	Oid			dbid;
+}	FdwRemoveXlogRec;
+
+extern void fdw_xact_redo(XLogReaderState *record);
+extern void fdw_xact_desc(StringInfo buf, XLogReaderState *record);
+extern const char *fdw_xact_identify(uint8 info);
+
+#endif	/* FDWXACT_XLOG_H */
diff --git a/src/include/access/resolver_private.h b/src/include/access/resolver_private.h
new file mode 100644
index 0000000..75704e9
--- /dev/null
+++ b/src/include/access/resolver_private.h
@@ -0,0 +1,61 @@
+/*-------------------------------------------------------------------------
+ *
+ * resolver_private.h
+ *	  Private definitions from access/transam/fdwxact/resolver.c
+ *
+ * Portions Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * src/include/access/resolver_private.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef _RESOLVER_PRIVATE_H
+#define _RESOLVER_PRIVATE_H
+
+#include "storage/latch.h"
+#include "storage/shmem.h"
+#include "storage/spin.h"
+#include "utils/timestamp.h"
+
+/*
+ * Each foreign transaction resolver has a FdwXactResolver struct in
+ * shared memory.  This struct is protected by FdwXactResolverLaunchLock.
+ */
+typedef struct FdwXactResolver
+{
+	pid_t	pid;	/* this resolver's PID, or 0 if not active */
+	Oid		dbid;	/* database oid */
+
+	/* Indicates if this slot is used of free */
+	bool	in_use;
+
+	/* Stats */
+	TimestampTz	last_resolved_time;
+
+	/* Protect shared variables shown above */
+	slock_t	mutex;
+
+	/*
+	 * Pointer to the resolver's patch. Used by backends to wake up this
+	 * resolver when it has work to do. NULL if the resolver isn't active.
+	 */
+	Latch	*latch;
+} FdwXactResolver;
+
+/* There is one FdwXactRslvCtlData struct for the whole database cluster */
+typedef struct FdwXactRslvCtlData
+{
+	/*
+	 * Foreign transaction resolution queue. Protected by FdwXactLock.
+	 */
+	SHM_QUEUE	FdwXactQueue;
+
+	FdwXactResolver resolvers[FLEXIBLE_ARRAY_MEMBER];
+} FdwXactRslvCtlData;
+
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+extern FdwXactResolver *MyFdwXactResolver;
+extern FdwXactRslvCtlData *FdwXactRslvCtl;
+
+#endif	/* _RESOLVER_PRIVATE_H */
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe987..c15dff7 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_FDW_XACT_ID, "Foreign Transactions", fdw_xact_redo, fdw_xact_desc, fdw_xact_identify, NULL, NULL, NULL)
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 34d9470..2864d98 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -35,6 +35,7 @@ extern void PostPrepare_Twophase(void);
 
 extern PGPROC *TwoPhaseGetDummyProc(TransactionId xid);
 extern BackendId TwoPhaseGetDummyBackendId(TransactionId xid);
+extern bool	TwoPhaseExists(TransactionId xid);
 
 extern GlobalTransaction MarkAsPreparing(TransactionId xid, const char *gid,
 				TimestampTz prepared_at,
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index a5c0746..bd3aa9a 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -227,6 +227,7 @@ typedef struct xl_parameter_change
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	int			wal_level;
 	bool		wal_log_hints;
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 773d9e6..3d5333a 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -178,6 +178,7 @@ typedef struct ControlFileData
 	int			MaxConnections;
 	int			max_worker_processes;
 	int			max_prepared_xacts;
+	int			max_prepared_foreign_xacts;
 	int			max_locks_per_xact;
 	bool		track_commit_timestamp;
 
diff --git a/src/include/catalog/pg_proc.h b/src/include/catalog/pg_proc.h
index 2a53213..25b7354 100644
--- a/src/include/catalog/pg_proc.h
+++ b/src/include/catalog/pg_proc.h
@@ -2903,6 +2903,8 @@ DATA(insert OID = 3317 (  pg_stat_get_wal_receiver	PGNSP PGUID 12 1 0 0 0 f f f
 DESCR("statistics: information about WAL receiver");
 DATA(insert OID = 6118 (  pg_stat_get_subscription	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 1 0 2249 "26" "{26,26,26,23,3220,1184,1184,3220,1184}" "{i,o,o,o,o,o,o,o,o}" "{subid,subid,relid,pid,received_lsn,last_msg_send_time,last_msg_receipt_time,latest_end_lsn,latest_end_time}" _null_ _null_ pg_stat_get_subscription _null_ _null_ _null_ ));
 DESCR("statistics: information about subscription");
+DATA(insert OID = 4101 (  pg_stat_get_fdwxact_resolver	PGNSP PGUID 12 1 0 0 0 f f f f f f s r 0 0 2249 "" "{26,26,26,1184}" "{o,o,o,o}" "{pid,dbid,n_entries,last_resolved_time}" _null_ _null_ pg_stat_get_fdwxact_resolver _null_ _null_ _null_ ));
+DESCR("statistics: information about subscription");
 DATA(insert OID = 2026 (  pg_backend_pid				PGNSP PGUID 12 1 0 0 0 f f f f t f s r 0 0 23 "" _null_ _null_ _null_ _null_ _null_ pg_backend_pid _null_ _null_ _null_ ));
 DESCR("statistics: current backend PID");
 DATA(insert OID = 1937 (  pg_stat_get_backend_pid		PGNSP PGUID 12 1 0 0 0 f f f f t f s r 1 0 23 "23" _null_ _null_ _null_ _null_ _null_ pg_stat_get_backend_pid _null_ _null_ _null_ ));
@@ -4232,6 +4234,15 @@ DESCR("get the available time zone names");
 DATA(insert OID = 2730 (  pg_get_triggerdef		PGNSP PGUID 12 1 0 0 0 f f f f t f s s 2 0 25 "26 16" _null_ _null_ _null_ _null_ _null_ pg_get_triggerdef_ext _null_ _null_ _null_ ));
 DESCR("trigger description with pretty-print option");
 
+/* foreign transactions */
+DATA(insert OID = 4099 ( pg_prepared_fdw_xacts	PGNSP PGUID 12 1 1000 0 0 f f f f t t v u 0 0 2249 "" "{26,28,26,26,25,25}" "{o,o,o,o,o,o}" "{dbid,transaction,serverid,userid,status,identifier}" _null_ _null_ pg_prepared_fdw_xacts _null_ _null_ _null_ ));
+DESCR("view foreign transactions");
+DATA(insert OID = 4100 ( pg_remove_fdw_xact PGNSP PGUID 12 1 0 0 0 f f f f f f v u 3 0 2278 "28 26 26" _null_ _null_ _null_ _null_ _null_ pg_remove_fdw_xact _null_ _null_ _null_ ));
+DESCR("remove foreign transactions");
+DATA(insert OID = 4139 ( pg_resolve_fdw_xact	PGNSP PGUID 12 1 0 0 0 f f f f t f v u 3 0 16 "28 26 26" _null_ _null_ _null_ _null_ _null_ pg_resolve_fdw_xact _null_ _null_ _null_ ));
+DESCR("resolve foreign transaction");
+
+
 /* asynchronous notifications */
 DATA(insert OID = 3035 (  pg_listening_channels PGNSP PGUID 12 1 10 0 0 f f f f t t s r 0 0 25 "" _null_ _null_ _null_ _null_ _null_ pg_listening_channels _null_ _null_ _null_ ));
 DESCR("get the channels that the current backend listens to");
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e88fee3..a17a114 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -162,6 +162,18 @@ typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
 
+typedef char *(*GetPrepareId_function) (Oid serverid, Oid userid,
+										int *prep_info_len);
+typedef bool (*PrepareForeignTransaction_function) (Oid serverid, Oid userid,
+													Oid umid, const char *prep_id);
+typedef bool (*EndForeignTransaction_function) (Oid serverid, Oid userid,
+												Oid umid, bool is_commit);
+typedef bool (*ResolvePreparedForeignTransaction_function) (Oid serverid,
+															Oid userid,
+															Oid umid,
+															bool is_commit,
+															const char *prep_id);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -226,6 +238,12 @@ typedef struct FdwRoutine
 	/* Support functions for IMPORT FOREIGN SCHEMA */
 	ImportForeignSchema_function ImportForeignSchema;
 
+	/* Support functions for distributed transactions */
+	GetPrepareId_function GetPrepareId;
+	EndForeignTransaction_function EndForeignTransaction;
+	PrepareForeignTransaction_function PrepareForeignTransaction;
+	ResolvePreparedForeignTransaction_function ResolvePreparedForeignTransaction;
+
 	/* Support functions for parallelism under Gather node */
 	IsForeignScanParallelSafe_function IsForeignScanParallelSafe;
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..10f0391 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,9 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_FDW_XACT_RESOLUTION,
+	WAIT_EVENT_FDW_XACT_RESOLVER_MAIN
 } WaitEventIPC;
 
 /* ----------
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 5c19a61..93953dc 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -150,6 +150,16 @@ struct PGPROC
 	SHM_QUEUE	syncRepLinks;	/* list link if process is in syncrep queue */
 
 	/*
+	 * Info to allow us to wait for foreign transaction to be resolved, if
+	 * needed.
+	 */
+	TransactionId	fdwXactWaitXid;	/* waiting for foreign transaction involved with
+									 * this transaction id to be resolved */
+	int			fdwXactState;	/* wait state for foreign transaction
+								 * resolution */
+	SHM_QUEUE	fdwXactLinks;	/* list link if process is in queue */
+
+	/*
 	 * All PROCLOCK objects for locks held or awaited by this backend are
 	 * linked into one of these lists, according to the partition number of
 	 * their lock.
diff --git a/src/test/recovery/Makefile b/src/test/recovery/Makefile
index daf79a0..71c8b9d 100644
--- a/src/test/recovery/Makefile
+++ b/src/test/recovery/Makefile
@@ -9,7 +9,7 @@
 #
 #-------------------------------------------------------------------------
 
-EXTRA_INSTALL=contrib/test_decoding
+EXTRA_INSTALL=contrib/test_decoding contrib/postgres_fdw
 
 subdir = src/test/recovery
 top_builddir = ../../..
diff --git a/src/test/recovery/t/015_fdwxact.pl b/src/test/recovery/t/015_fdwxact.pl
new file mode 100644
index 0000000..28c9de9
--- /dev/null
+++ b/src/test/recovery/t/015_fdwxact.pl
@@ -0,0 +1,176 @@
+# Tests for transaction involving foreign servers
+use strict;
+use warnings;
+use PostgresNode;
+use TestLib;
+use Test::More tests => 7;
+
+# Setup master node
+my $node_master = get_new_node("master");
+my $node_standby = get_new_node("standby");
+
+$node_master->init(allows_streaming => 1);
+$node_master->append_conf('postgresql.conf', qq(
+max_prepared_transactions = 10
+max_prepared_foreign_transactions = 10
+max_foreign_transaction_resolvers = 2
+foreign_transaction_resolver_timeout = 0
+foreign_transaction_resolution_interval = 5s
+));
+$node_master->start;
+
+# Take backup from master node
+my $backup_name = 'master_backup';
+$node_master->backup($backup_name);
+
+# Set up standby node
+$node_standby->init_from_backup($node_master, $backup_name,
+							   has_streaming => 1);
+$node_standby->start;
+
+# Set up foreign nodes
+my $node_fs1 = get_new_node("fs1");
+my $node_fs2 = get_new_node("fs2");
+my $fs1_port = $node_fs1->port;
+my $fs2_port = $node_fs2->port;
+$node_fs1->init;
+$node_fs2->init;
+$node_fs1->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs2->append_conf('postgresql.conf', qq(max_prepared_transactions = 10));
+$node_fs1->start;
+$node_fs2->start;
+
+# Create foreign servers on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE EXTENSION postgres_fdw
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs1 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs1_port', two_phase_commit 'on');
+));
+$node_master->safe_psql('postgres', qq(
+CREATE SERVER fs2 FOREIGN DATA WRAPPER postgres_fdw
+OPTIONS (dbname 'postgres', port '$fs2_port', two_phase_commit 'on');
+));
+
+# Create user mapping on the master node
+$node_master->safe_psql('postgres', qq(
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs1;
+CREATE USER MAPPING FOR CURRENT_USER SERVER fs2;
+));
+
+# Create tables on foreign nodes and import them to the master node
+$node_fs1->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t1 (c int);
+));
+$node_fs2->safe_psql('postgres', qq(
+CREATE SCHEMA fs;
+CREATE TABLE fs.t2 (c int);
+));
+$node_master->safe_psql('postgres', qq(
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs1 INTO public;
+IMPORT FOREIGN SCHEMA fs FROM SERVER fs2 INTO public;
+CREATE TABLE l_table (c int);
+));
+
+# Switch to synchronous replication
+$node_master->safe_psql('postgres', qq(
+ALTER SYSTEM SET synchronous_standby_names ='*';
+));
+$node_master->reload;
+
+my $result;
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node. Check if we can commit and rollback the foreign transactions
+# after the normal recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->stop;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after recovery');
+
+#
+# Prepare two transactions involving multiple foreign servers and shutdown
+# the master node immediately. Check if we can commit and rollback the foreign
+# transactions after the crash recovery.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (1);
+INSERT INTO t2 VALUES (1);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (2);
+INSERT INTO t2 VALUES (2);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+# Commit and rollback foreign transactions after the crash recovery.
+$result = $node_master->psql('postgres', qq(COMMIT PREPARED 'gxid1'));
+is($result, 0, 'Commit foreign transactions after crash recovery');
+$result = $node_master->psql('postgres', qq(ROLLBACK PREPARED 'gxid2'));
+is($result, 0, 'Rollback foreign transactions after crash recovery');
+
+#
+# Commit transaction involving foreign servers and shutdown the master node
+# immediately before checkpoint. Check that WAL replay cleans up
+# its shared memory state release locks while replaying transaction commit.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (3);
+INSERT INTO t2 VALUES (3);
+COMMIT;
+));
+
+$node_master->teardown_node;
+$node_master->start;
+
+$result = $node_master->safe_psql('postgres', qq(
+SELECT count(*) FROM pg_prepared_fdw_xacts;
+));
+is($result, 0, "Cleanup of shared memory state for foreign transactions");
+
+#
+# Check if the standby node can process prepared foreign transaction
+# after promotion.
+#
+$node_master->safe_psql('postgres', qq(
+BEGIN;
+INSERT INTO t1 VALUES (4);
+INSERT INTO t2 VALUES (4);
+PREPARE TRANSACTION 'gxid1';
+BEGIN;
+INSERT INTO t1 VALUES (5);
+INSERT INTO t2 VALUES (5);
+PREPARE TRANSACTION 'gxid2';
+));
+
+$node_master->teardown_node;
+$node_standby->promote;
+
+$result = $node_standby->psql('postgres', qq(COMMIT PREPARED 'gxid1';));
+is($result, 0, 'Commit foreign transaction after promotion');
+$result = $node_standby->psql('postgres', qq(ROLLBACK PREPARED 'gxid2';));
+is($result, 0, 'Rollback foreign transaction after promotion');
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 5433944..ca155ec 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1413,6 +1413,13 @@ pg_policies| SELECT n.nspname AS schemaname,
    FROM ((pg_policy pol
      JOIN pg_class c ON ((c.oid = pol.polrelid)))
      LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
+pg_prepared_fdw_xacts| SELECT f.dbid,
+    f.transaction,
+    f.serverid,
+    f.userid,
+    f.status,
+    f.identifier
+   FROM pg_prepared_fdw_xacts() f(dbid, transaction, serverid, userid, status, identifier);
 pg_prepared_statements| SELECT p.name,
     p.statement,
     p.prepare_time,
@@ -1819,6 +1826,11 @@ pg_stat_database_conflicts| SELECT d.oid AS datid,
     pg_stat_get_db_conflict_bufferpin(d.oid) AS confl_bufferpin,
     pg_stat_get_db_conflict_startup_deadlock(d.oid) AS confl_deadlock
    FROM pg_database d;
+pg_stat_fdwxact_resolvers| SELECT r.pid,
+    r.dbid,
+    r.last_resolved_time
+   FROM pg_stat_get_fdwxact_resolver() r(pid, dbid, n_entries, last_resolved_time)
+  WHERE (r.pid IS NOT NULL);
 pg_stat_progress_vacuum| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index a1ee104..35d66f4 100644
--- a/src/test/regress/pg_regress.c
+++ b/src/test/regress/pg_regress.c
@@ -2292,9 +2292,12 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		 * Adjust the default postgresql.conf for regression testing. The user
 		 * can specify a file to be appended; in any case we expand logging
 		 * and set max_prepared_transactions to enable testing of prepared
-		 * xacts.  (Note: to reduce the probability of unexpected shmmax
-		 * failures, don't set max_prepared_transactions any higher than
-		 * actually needed by the prepared_xacts regression test.)
+		 * xacts.  We also set max_prepared_foreign_transactions and
+		 * max_foreign_transaction_resolvers to enable testing of transaction
+		 * involving multiple foreign servers. (Note: to reduce the probability
+		 * of unexpected shmmax failures, don't set max_prepared_transactions
+		 * any higher than actually needed by the prepared_xacts regression
+		 * test.)
 		 */
 		snprintf(buf, sizeof(buf), "%s/data/postgresql.conf", temp_instance);
 		pg_conf = fopen(buf, "a");
@@ -2309,7 +2312,9 @@ regression_main(int argc, char *argv[], init_function ifunc, test_function tfunc
 		fputs("log_line_prefix = '%m [%p] %q%a '\n", pg_conf);
 		fputs("log_lock_waits = on\n", pg_conf);
 		fputs("log_temp_files = 128kB\n", pg_conf);
-		fputs("max_prepared_transactions = 2\n", pg_conf);
+		fputs("max_prepared_transactions = 3\n", pg_conf);
+		fputs("max_prepared_foreign_transactions = 2\n", pg_conf);
+		fputs("max_foreign_transaction_resolvers = 2\n", pg_conf);
 
 		for (sl = temp_configs; sl != NULL; sl = sl->next)
 		{
-- 
1.7.1

0003-postgres_fdw-supports-atomic-distributed-transaction_v15.patchapplication/octet-stream; name=0003-postgres_fdw-supports-atomic-distributed-transaction_v15.patchDownload

From 2b210cdb93e352c0599e8b257c0683fd58b5ddf7 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.mshk@gmail.com>
Date: Tue, 26 Dec 2017 19:52:05 +0900
Subject: [PATCH 3/3] postgres_fdw supports atomic distributed transaction commit.

---
 contrib/postgres_fdw/connection.c              |  566 +++++++++++++++---------
 contrib/postgres_fdw/expected/postgres_fdw.out |  343 ++++++++++++++-
 contrib/postgres_fdw/option.c                  |    5 +-
 contrib/postgres_fdw/postgres_fdw.c            |   93 ++++-
 contrib/postgres_fdw/postgres_fdw.h            |   14 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  133 ++++++
 doc/src/sgml/postgres-fdw.sgml                 |   37 ++
 7 files changed, 952 insertions(+), 239 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 00c926b..775e0c0 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -14,9 +14,11 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
 #include "access/htup_details.h"
 #include "catalog/pg_user_mapping.h"
 #include "access/xact.h"
+#include "commands/defrem.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -73,13 +75,13 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
-static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
+static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user,
+								 bool connection_error_ok);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values, UserMapping *user);
 static void configure_remote_session(PGconn *conn);
 static void do_sql_command(PGconn *conn, const char *sql);
-static void begin_remote_xact(ConnCacheEntry *entry);
-static void pgfdw_xact_callback(XactEvent event, void *arg);
+static void begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid);
 static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId mySubid,
 					   SubTransactionId parentSubid,
@@ -91,24 +93,27 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 						 bool ignore_errors);
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 						 PGresult **result);
+static void pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit);
+static ConnCacheEntry *GetConnectionCacheEntry(Oid umid);
 
-
-/*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- */
 PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+GetExistingConnection(Oid umid)
 {
-	bool		found;
-	ConnCacheEntry *entry;
-	ConnCacheKey key;
+	ConnCacheEntry	*entry;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	Assert(entry->conn != NULL);
+
+	return entry->conn;
+}
+
+static ConnCacheEntry *
+GetConnectionCacheEntry(Oid umid)
+{
+	ConnCacheEntry	*entry;
+	ConnCacheKey	key;
+	bool			found;
 
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
@@ -128,7 +133,6 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		 * Register some callback functions that manage connection cleanup.
 		 * This should be done just once in each backend.
 		 */
-		RegisterXactCallback(pgfdw_xact_callback, NULL);
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 		CacheRegisterSyscacheCallback(FOREIGNSERVEROID,
 									  pgfdw_inval_callback, (Datum) 0);
@@ -136,11 +140,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 									  pgfdw_inval_callback, (Datum) 0);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -155,6 +156,28 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->conn = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt,
+			  bool start_transaction, bool connection_error_ok)
+{
+	ConnCacheEntry *entry;
+
+	/* Get connection cache entry from cache */
+	entry = GetConnectionCacheEntry(user->umid);
+
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
@@ -198,7 +221,16 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 								  ObjectIdGetDatum(user->umid));
 
 		/* Now try to make the connection */
-		entry->conn = connect_pg_server(server, user);
+		entry->conn = connect_pg_server(server, user, connection_error_ok);
+
+		Assert(entry->conn || connection_error_ok);
+
+		if (!entry->conn && connection_error_ok)
+		{
+			elog(DEBUG3, "attempt to connection to server \"%s\" by postgres_fdw failed",
+				 server->servername);
+			return NULL;
+		}
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
@@ -207,7 +239,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/*
 	 * Start a new transaction or subtransaction if needed.
 	 */
-	begin_remote_xact(entry);
+	if (start_transaction)
+	{
+		begin_remote_xact(entry, user->serverid, user->userid);
+		/* Set flag that we did GetConnection during the current transaction */
+		xact_got_connection = true;
+	}
 
 	/* Remember if caller will prepare statements */
 	entry->have_prep_stmt |= will_prep_stmt;
@@ -217,9 +254,12 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 /*
  * Connect to remote server using specified server and user mapping properties.
+ * If the attempt to connect fails, and the caller can handle connection failure
+ * (connection_error_ok = true) return NULL, throw error otherwise.
  */
 static PGconn *
-connect_pg_server(ForeignServer *server, UserMapping *user)
+connect_pg_server(ForeignServer *server, UserMapping *user,
+				  bool connection_error_ok)
 {
 	PGconn	   *volatile conn = NULL;
 
@@ -265,11 +305,25 @@ connect_pg_server(ForeignServer *server, UserMapping *user)
 
 		conn = PQconnectdbParams(keywords, values, false);
 		if (!conn || PQstatus(conn) != CONNECTION_OK)
-			ereport(ERROR,
-					(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
-					 errmsg("could not connect to server \"%s\"",
-							server->servername),
-					 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		{
+			char	   *connmessage;
+			int			msglen;
+
+			/* libpq typically appends a newline, strip that */
+			connmessage = pstrdup(PQerrorMessage(conn));
+			msglen = strlen(connmessage);
+			if (msglen > 0 && connmessage[msglen - 1] == '\n')
+				connmessage[msglen - 1] = '\0';
+
+			if (connection_error_ok)
+				return NULL;
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_SQLCLIENT_UNABLE_TO_ESTABLISH_SQLCONNECTION),
+						 errmsg("could not connect to server \"%s\"",
+								server->servername),
+						 errdetail_internal("%s", pchomp(PQerrorMessage(conn)))));
+		}
 
 		/*
 		 * Check that non-superuser has used password to establish connection;
@@ -414,15 +468,24 @@ do_sql_command(PGconn *conn, const char *sql)
  * control which remote queries share a snapshot.
  */
 static void
-begin_remote_xact(ConnCacheEntry *entry)
+begin_remote_xact(ConnCacheEntry *entry, Oid serverid, Oid userid)
 {
 	int			curlevel = GetCurrentTransactionNestLevel();
+	ForeignServer	*server = GetForeignServer(serverid);
 
 	/* Start main transaction if we haven't yet */
 	if (entry->xact_depth <= 0)
 	{
 		const char *sql;
 
+		/*
+		 * Register the new foreign server and check whether the two phase
+		 * compliance is possible.
+		 */
+		FdwXactRegisterForeignServer(serverid, userid,
+									 server_uses_two_phase_commit(server),
+									 false);
+
 		elog(DEBUG3, "starting remote transaction on connection %p",
 			 entry->conn);
 
@@ -644,193 +707,6 @@ pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 }
 
 /*
- * pgfdw_xact_callback --- cleanup at main-transaction end.
- */
-static void
-pgfdw_xact_callback(XactEvent event, void *arg)
-{
-	HASH_SEQ_STATUS scan;
-	ConnCacheEntry *entry;
-
-	/* Quick exit if no connections were touched in this transaction. */
-	if (!xact_got_connection)
-		return;
-
-	/*
-	 * Scan all connection cache entries to find open remote transactions, and
-	 * close them.
-	 */
-	hash_seq_init(&scan, ConnectionHash);
-	while ((entry = (ConnCacheEntry *) hash_seq_search(&scan)))
-	{
-		PGresult   *res;
-
-		/* Ignore cache entry if no open connection right now */
-		if (entry->conn == NULL)
-			continue;
-
-		/* If it has an open remote transaction, try to close it */
-		if (entry->xact_depth > 0)
-		{
-			bool		abort_cleanup_failure = false;
-
-			elog(DEBUG3, "closing remote transaction on connection %p",
-				 entry->conn);
-
-			switch (event)
-			{
-				case XACT_EVENT_PARALLEL_PRE_COMMIT:
-				case XACT_EVENT_PRE_COMMIT:
-
-					/*
-					 * If abort cleanup previously failed for this connection,
-					 * we can't issue any more commands against it.
-					 */
-					pgfdw_reject_incomplete_xact_state_change(entry);
-
-					/* Commit all remote transactions during pre-commit */
-					entry->changing_xact_state = true;
-					do_sql_command(entry->conn, "COMMIT TRANSACTION");
-					entry->changing_xact_state = false;
-
-					/*
-					 * If there were any errors in subtransactions, and we
-					 * made prepared statements, do a DEALLOCATE ALL to make
-					 * sure we get rid of all prepared statements. This is
-					 * annoying and not terribly bulletproof, but it's
-					 * probably not worth trying harder.
-					 *
-					 * DEALLOCATE ALL only exists in 8.3 and later, so this
-					 * constrains how old a server postgres_fdw can
-					 * communicate with.  We intentionally ignore errors in
-					 * the DEALLOCATE, so that we can hobble along to some
-					 * extent with older servers (leaking prepared statements
-					 * as we go; but we don't really support update operations
-					 * pre-8.3 anyway).
-					 */
-					if (entry->have_prep_stmt && entry->have_error)
-					{
-						res = PQexec(entry->conn, "DEALLOCATE ALL");
-						PQclear(res);
-					}
-					entry->have_prep_stmt = false;
-					entry->have_error = false;
-					break;
-				case XACT_EVENT_PRE_PREPARE:
-
-					/*
-					 * We disallow remote transactions that modified anything,
-					 * since it's not very reasonable to hold them open until
-					 * the prepared transaction is committed.  For the moment,
-					 * throw error unconditionally; later we might allow
-					 * read-only cases.  Note that the error will cause us to
-					 * come right back here with event == XACT_EVENT_ABORT, so
-					 * we'll clean up the connection state at that point.
-					 */
-					ereport(ERROR,
-							(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
-							 errmsg("cannot prepare a transaction that modified remote tables")));
-					break;
-				case XACT_EVENT_PARALLEL_COMMIT:
-				case XACT_EVENT_COMMIT:
-				case XACT_EVENT_PREPARE:
-					/* Pre-commit should have closed the open transaction */
-					elog(ERROR, "missed cleaning up connection during pre-commit");
-					break;
-				case XACT_EVENT_PARALLEL_ABORT:
-				case XACT_EVENT_ABORT:
-
-					/*
-					 * Don't try to clean up the connection if we're already
-					 * in error recursion trouble.
-					 */
-					if (in_error_recursion_trouble())
-						entry->changing_xact_state = true;
-
-					/*
-					 * If connection is already unsalvageable, don't touch it
-					 * further.
-					 */
-					if (entry->changing_xact_state)
-						break;
-
-					/*
-					 * Mark this connection as in the process of changing
-					 * transaction state.
-					 */
-					entry->changing_xact_state = true;
-
-					/* Assume we might have lost track of prepared statements */
-					entry->have_error = true;
-
-					/*
-					 * If a command has been submitted to the remote server by
-					 * using an asynchronous execution function, the command
-					 * might not have yet completed.  Check to see if a
-					 * command is still being processed by the remote server,
-					 * and if so, request cancellation of the command.
-					 */
-					if (PQtransactionStatus(entry->conn) == PQTRANS_ACTIVE &&
-						!pgfdw_cancel_query(entry->conn))
-					{
-						/* Unable to cancel running query. */
-						abort_cleanup_failure = true;
-					}
-					else if (!pgfdw_exec_cleanup_query(entry->conn,
-													   "ABORT TRANSACTION",
-													   false))
-					{
-						/* Unable to abort remote transaction. */
-						abort_cleanup_failure = true;
-					}
-					else if (entry->have_prep_stmt && entry->have_error &&
-							 !pgfdw_exec_cleanup_query(entry->conn,
-													   "DEALLOCATE ALL",
-													   true))
-					{
-						/* Trouble clearing prepared statements. */
-						abort_cleanup_failure = true;
-					}
-					else
-					{
-						entry->have_prep_stmt = false;
-						entry->have_error = false;
-					}
-
-					/* Disarm changing_xact_state if it all worked. */
-					entry->changing_xact_state = abort_cleanup_failure;
-					break;
-			}
-		}
-
-		/* Reset state to show we're out of a transaction */
-		entry->xact_depth = 0;
-
-		/*
-		 * If the connection isn't in a good idle state, discard it to
-		 * recover. Next GetConnection will open a new connection.
-		 */
-		if (PQstatus(entry->conn) != CONNECTION_OK ||
-			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
-			entry->changing_xact_state)
-		{
-			elog(DEBUG3, "discarding connection %p", entry->conn);
-			disconnect_pg_server(entry);
-		}
-	}
-
-	/*
-	 * Regardless of the event type, we can now mark ourselves as out of the
-	 * transaction.  (Note: if we are here during PRE_COMMIT or PRE_PREPARE,
-	 * this saves a useless scan of the hashtable during COMMIT or PREPARE.)
-	 */
-	xact_got_connection = false;
-
-	/* Also reset cursor numbering for next transaction */
-	cursor_number = 0;
-}
-
-/*
  * pgfdw_subxact_callback --- cleanup at subtransaction end.
  */
 static void
@@ -1193,3 +1069,255 @@ exit:	;
 		*result = last_res;
 	return timed_out;
 }
+
+/*
+ * The function prepares transaction on foreign server. This function
+ * is called only at the pre-commit phase of the local transaction. Since
+ * we should have the connection to the server that we are interested in
+ * we don't use serverid and userid that are necessary to get user mapping
+ * that is the key of the connection cache.
+ */
+bool
+postgresPrepareForeignTransaction(Oid serverid, Oid userid, Oid umid,
+								  const char *prep_id)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool		result;
+
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+		command = makeStringInfo();
+		appendStringInfo(command, "PREPARE TRANSACTION '%s'", prep_id);
+
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		elog(DEBUG1, "prepare foreign transaction on server %u with ID %s",
+			 serverid, prep_id);
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or abort the transactionon foreign server. This
+ * function is called both at the pre-commit phase of the local transaction
+ * when committing and at the end of the local transaction when aborting.
+ * Since we should the connections to the server that involved with the local
+ * transaction we don't use serverid and userid that are necessary to get
+ * user mapping that is the key of connection cache.
+ */
+bool
+postgresEndForeignTransaction(Oid serverid, Oid userid, Oid umid,
+							  bool is_commit)
+{
+	ConnCacheEntry *entry = NULL;
+
+	entry = GetConnectionCacheEntry(umid);
+
+	/*
+	 * If abort cleanup previously failed for this connection, we can't issue
+	 * any more commands against it.
+	 */
+	if (is_commit)
+		pgfdw_reject_incomplete_xact_state_change(entry);
+
+	if (entry->conn)
+	{
+		StringInfo	command;
+		bool	result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s TRANSACTION",	is_commit ? "COMMIT" : "ROLLBACK");
+		entry->changing_xact_state = true;
+		result = pgfdw_exec_cleanup_query(entry->conn, command->data, false);
+		entry->changing_xact_state = false;
+
+		pgfdw_cleanup_after_transaction(entry, true);
+		return result;
+	}
+
+	return false;
+}
+
+/*
+ * The function commits or aborts prepared transaction on foreign server.
+ * This function could be called both at end of the local transaction and
+ * in a new transaction, for example, by the resolver process.
+ */
+bool
+postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid, Oid umid,
+										  bool is_commit, const char *prep_id)
+{
+	ConnCacheEntry *entry;
+	PGconn			*conn;
+
+	/*
+	 * If we are in a valid transaction state it means that we are trying to
+	 * resolve a transaction in a new transaction just before started and that
+	 * we might not have a connect to the server yet. So we use GetConnection
+	 * which establishes the connection if don't have it yet. This can happen when
+	 * the foreign transaction resolve process tries to resolve it. On the other
+	 * hand, if we are not in a valid transaction state it means that we are trying
+	 * to resolve a foreign transaction at end of the local transaction. Since we
+	 * should have the connection to the server we just get a connection cache entry.
+	 */
+	if (IsTransactionState())
+		conn = GetConnection(GetUserMapping(userid, serverid), false, false, false);
+	else
+	{
+		entry = GetConnectionCacheEntry(umid);
+
+		/* Reject further use of connections which failed abort cleanup */
+		if (is_commit)
+			pgfdw_reject_incomplete_xact_state_change(entry);
+
+		conn = entry->conn;
+	}
+
+	if (conn)
+	{
+		StringInfo		command;
+		PGresult		*res;
+		bool			result;
+
+		command = makeStringInfo();
+		appendStringInfo(command, "%s PREPARED '%s'",
+						 is_commit ? "COMMIT" : "ROLLBACK",
+						 prep_id);
+		res = PQexec(conn, command->data);
+
+		if (PQresultStatus(res) != PGRES_COMMAND_OK)
+		{
+			int		sqlstate;
+			char	*diag_sqlstate = PQresultErrorField(res, PG_DIAG_SQLSTATE);
+
+			/*
+			 * The command failed, raise a warning to log the reason of failure.
+			 * We may not be in a transaction here, so raising error doesn't
+			 * help. Even if we are in a transaction, it would be the resolver
+			 * transaction, which will get aborted on raising error, thus
+			 * delaying resolution of other prepared foreign transactions.
+			 */
+			pgfdw_report_error(WARNING, res, conn, false, command->data);
+
+			if (diag_sqlstate)
+			{
+				sqlstate = MAKE_SQLSTATE(diag_sqlstate[0],
+										 diag_sqlstate[1],
+										 diag_sqlstate[2],
+										 diag_sqlstate[3],
+										 diag_sqlstate[4]);
+			}
+			else
+				sqlstate = ERRCODE_CONNECTION_FAILURE;
+
+			/*
+			 * If we tried to COMMIT/ABORT a prepared transaction and the prepared
+			 * transaction was missing on the foreign server, it was probably
+			 * resolved by some other means. Anyway, it should be considered as resolved.
+			 */
+			result = (sqlstate == ERRCODE_UNDEFINED_OBJECT);
+		}
+		else
+			result = true;
+
+		elog(DEBUG1, "%s prepared foreign transaction on server %u with ID %s",
+			 is_commit ? "committed" : "aborted", serverid, prep_id);
+
+		PQclear(res);
+		ReleaseConnection(conn);
+		return result;
+	}
+
+	return false;
+}
+
+/* Cleanup at main-transaction end */
+static void
+pgfdw_cleanup_after_transaction(ConnCacheEntry *entry, bool is_commit)
+{
+	if (entry->xact_depth > 0)
+	{
+		if (is_commit)
+		{
+			/*
+			 * If there were any errors in subtransactions, and we made prepared
+			 * statements, do a DEALLOCATE ALL to make sure we get rid of all
+			 * prepared statements. This is annoying and not terribly bulletproof,
+			 * but it's probably not worth trying harder.
+			 *
+			 * DEALLOCATE ALL only exists in 8.3 and later, so this constrains how
+			 * old a server postgres_fdw can communicate with.	We intentionally
+			 * ignore errors in the DEALLOCATE, so that we can hobble along to some
+			 * extent with older servers (leaking prepared statements as we go;
+			 * but we don't really support update operations pre-8.3 anyway).
+			 */
+			if (entry->have_prep_stmt && entry->have_error)
+			{
+				PGresult *res = PQexec(entry->conn, "DEALLOCATE ALL");
+				PQclear(res);
+			}
+
+			entry->have_prep_stmt = false;
+			entry->have_error = false;
+		}
+		else
+		{
+			/*
+			 * Don't try to clean up the connection if we're already in error
+			 * recursion trouble.
+			 */
+			if (in_error_recursion_trouble())
+				entry->changing_xact_state = true;
+
+			/* If connection is already unsalvageable, don't touch it further */
+
+			if (!entry->changing_xact_state)
+			{
+				entry->changing_xact_state = true;
+
+				if (entry->have_prep_stmt &&
+					!pgfdw_exec_cleanup_query(entry->conn, "DEALLOCATE ALL", true))
+				{
+					entry->changing_xact_state = true;
+				}
+			}
+		}
+		/* Reset state to show we're out of a transaction */
+		entry->xact_depth = 0;
+
+		/*
+		 * If the connection isn't in a good idle state, discard it to
+		 * recover. Next GetConnection will open a new connection.
+		 */
+		if (PQstatus(entry->conn) != CONNECTION_OK ||
+			PQtransactionStatus(entry->conn) != PQTRANS_IDLE)
+		{
+			elog(DEBUG3, "discarding connection %p", entry->conn);
+			PQfinish(entry->conn);
+			entry->conn = NULL;
+		}
+	}
+
+	/*
+	 * Regardless of the event type, we can now mark ourselves as out of the
+	 * transaction.
+	 */
+	xact_got_connection = false;
+
+	/* Also reset cursor numbering for next transaction */
+	cursor_number = 0;
+}
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index adbf77f..e3fc66c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -13,12 +13,17 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 -- ===================================================================
 -- create objects used through FDW loopback server
 -- ===================================================================
@@ -52,6 +57,13 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 INSERT INTO "S 1"."T 1"
 	SELECT id,
 	       id % 10,
@@ -82,6 +94,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 -- ===================================================================
 -- create foreign tables
 -- ===================================================================
@@ -124,6 +137,15 @@ CREATE FOREIGN TABLE ft6 (
 	c2 int NOT NULL,
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -180,16 +202,19 @@ ALTER FOREIGN TABLE ft2 OPTIONS (schema_name 'S 1', table_name 'T 1');
 ALTER FOREIGN TABLE ft1 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 ALTER FOREIGN TABLE ft2 ALTER COLUMN c1 OPTIONS (column_name 'C 1');
 \det+
-                                      List of foreign tables
- Schema |   Table    |  Server   |                   FDW options                    | Description 
---------+------------+-----------+--------------------------------------------------+-------------
- public | ft1        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft2        | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
- public | ft4        | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
- public | ft5        | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft6        | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
- public | ft_pg_type | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
-(6 rows)
+                                         List of foreign tables
+ Schema |      Table       |  Server   |                   FDW options                    | Description 
+--------+------------------+-----------+--------------------------------------------------+-------------
+ public | ft1              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft2              | loopback  | (schema_name 'S 1', table_name 'T 1')            | 
+ public | ft4              | loopback  | (schema_name 'S 1', table_name 'T 3')            | 
+ public | ft5              | loopback  | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft6              | loopback2 | (schema_name 'S 1', table_name 'T 4')            | 
+ public | ft7_not_twophase | loopback  | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft8_twophase     | loopback2 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft9_twophase     | loopback3 | (schema_name 'S 1', table_name 'T 5')            | 
+ public | ft_pg_type       | loopback  | (schema_name 'pg_catalog', table_name 'pg_type') | 
+(9 rows)
 
 -- Test that alteration of server options causes reconnection
 -- Remote's errors might be non-English, so hide them to ensure stable results
@@ -7794,3 +7819,301 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 (4 rows)
 
 RESET enable_partition_wise_join;
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+  srvname  
+-----------
+ loopback
+ loopback2
+ loopback3
+(3 rows)
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+(1 row)
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+(3 rows)
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+(4 rows)
+
+SELECT * FROM "S 1"."T 6";
+ c1 
+----
+  4
+(1 row)
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+ERROR:  null value in column "c1" violates not-null constraint
+DETAIL:  Failing row contains (null).
+CONTEXT:  Remote SQL command: INSERT INTO "S 1"."T 5"(c1) VALUES ($1)
+COMMIT;
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+(6 rows)
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft9_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+ERROR:  can not prepare the transaction because some foreign servers involved in transaction can not prepare the transaction
+SELECT * FROM ft7_not_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
+SELECT * FROM ft8_twophase;
+ c1 
+----
+  1
+  2
+  2
+  4
+  6
+  6
+  9
+  9
+(8 rows)
+
diff --git a/contrib/postgres_fdw/option.c b/contrib/postgres_fdw/option.c
index 082f79a..dadd519 100644
--- a/contrib/postgres_fdw/option.c
+++ b/contrib/postgres_fdw/option.c
@@ -108,7 +108,8 @@ postgres_fdw_validator(PG_FUNCTION_ARGS)
 		 * Validate option value, when we can do so without any context.
 		 */
 		if (strcmp(def->defname, "use_remote_estimate") == 0 ||
-			strcmp(def->defname, "updatable") == 0)
+			strcmp(def->defname, "updatable") == 0 ||
+			strcmp(def->defname, "two_phase_commit") == 0)
 		{
 			/* these accept only boolean values */
 			(void) defGetBoolean(def);
@@ -177,6 +178,8 @@ InitPgFdwOptions(void)
 		/* fetch_size is available on both server and table */
 		{"fetch_size", ForeignServerRelationId, false},
 		{"fetch_size", ForeignTableRelationId, false},
+		/* two phase commit support */
+		{"two_phase_commit", ForeignServerRelationId, false},
 		{NULL, InvalidOid, false}
 	};
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c1d7f80..1f96e64 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -14,6 +14,8 @@
 
 #include "postgres_fdw.h"
 
+#include "access/fdwxact.h"
+#include "access/xact.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
 #include "catalog/pg_class.h"
@@ -353,6 +355,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+extern char*postgresGetPrepareId(Oid serveroid, Oid userid, int *prep_info_len);
 
 /*
  * Helper functions
@@ -434,7 +437,6 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
 				  const PgFdwRelationInfo *fpinfo_o,
 				  const PgFdwRelationInfo *fpinfo_i);
 
-
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
  * to my callback routines.
@@ -483,6 +485,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for foreign transactions */
+	routine->GetPrepareId = postgresGetPrepareId;
+	routine->PrepareForeignTransaction = postgresPrepareForeignTransaction;
+	routine->ResolvePreparedForeignTransaction = postgresResolvePreparedForeignTransaction;
+	routine->EndForeignTransaction = postgresEndForeignTransaction;
+
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
@@ -490,6 +498,38 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 }
 
 /*
+ * postgresGetPrepareId
+ *
+ * The function crafts prepared transaction identifier. PostgreSQL documentation
+ * mentions two restrictions on the name
+ * 1. String literal, less than 200 bytes long.
+ * 2. Should not be same as any other concurrent prepared transaction id.
+ *
+ * To make the prepared transaction id, we should ideally use something like
+ * UUID, which gives unique ids with high probability, but that may be expensive
+ * here and UUID extension which provides the function to generate UUID is
+ * not part of the core.
+ */
+extern char *
+postgresGetPrepareId(Oid serverid, Oid userid, int *prep_info_len)
+{
+	/* Maximum length of the prepared transaction id, borrowed from twophase.c */
+#define PREP_XACT_ID_MAX_LEN 200
+#define RANDOM_LARGE_MULTIPLIER 1000
+	char*prep_info;
+
+	/* Allocate the memory in the same context as the hash entry */
+	prep_info = (char *)palloc(PREP_XACT_ID_MAX_LEN * sizeof(char));
+	snprintf(prep_info, PREP_XACT_ID_MAX_LEN, "%s_%4ld_%d_%d",
+			 "px", Abs(random() * RANDOM_LARGE_MULTIPLIER),
+			 serverid, userid);
+
+	/* Account for the last NULL byte */
+	*prep_info_len = strlen(prep_info);
+	return prep_info;
+}
+
+/*
  * postgresGetForeignRelSize
  *		Estimate # of rows and width of the result of the scan
  *
@@ -1336,7 +1376,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->conn = GetConnection(user, false, true, false);
 
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
@@ -1685,6 +1725,7 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	Oid			userid;
 	ForeignTable *table;
 	UserMapping *user;
+	ForeignServer *server;
 	AttrNumber	n_params;
 	Oid			typefnoid;
 	bool		isvarlena;
@@ -1712,9 +1753,15 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	/* Get info about foreign table. */
 	table = GetForeignTable(RelationGetRelid(rel));
 	user = GetUserMapping(userid, table->serverid);
+	server = GetForeignServer(user->serverid);
+
+	/* Remember this foreign server has been modified */
+	FdwXactRegisterForeignServer(user->serverid, user->userid,
+								 server_uses_two_phase_commit(server),
+								 true);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->conn = GetConnection(user, true, true, false);
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2318,6 +2365,7 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	RangeTblEntry *rte;
 	Oid			userid;
 	ForeignTable *table;
+	ForeignServer *server;
 	UserMapping *user;
 	int			numParams;
 
@@ -2348,12 +2396,18 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 		dmstate->rel = node->ss.ss_currentRelation;
 	table = GetForeignTable(RelationGetRelid(dmstate->rel));
 	user = GetUserMapping(userid, table->serverid);
+	server = GetForeignServer(user->serverid);
+
+	/* Remember this foreign server has been modified */
+	FdwXactRegisterForeignServer(user->serverid, user->userid,
+								 server_uses_two_phase_commit(server),
+								 true);
 
 	/*
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->conn = GetConnection(user, false, true, false);
 
 	/* Update the foreign-join-related fields. */
 	if (fsplan->scan.scanrelid == 0)
@@ -2650,7 +2704,7 @@ estimate_path_cost_size(PlannerInfo *root,
 								&retrieved_attrs, NULL);
 
 		/* Get the remote estimate */
-		conn = GetConnection(fpinfo->user, false);
+		conn = GetConnection(fpinfo->user, false, true, false);
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3910,7 +3964,7 @@ postgresAnalyzeForeignTable(Relation relation,
 	 */
 	table = GetForeignTable(RelationGetRelid(relation));
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct command to get page count for relation.
@@ -4000,7 +4054,7 @@ postgresAcquireSampleRowsFunc(Relation relation, int elevel,
 	table = GetForeignTable(RelationGetRelid(relation));
 	server = GetForeignServer(table->serverid);
 	user = GetUserMapping(relation->rd_rel->relowner, table->serverid);
-	conn = GetConnection(user, false);
+	conn = GetConnection(user, false, true, false);
 
 	/*
 	 * Construct cursor that retrieves whole rows from remote.
@@ -4223,7 +4277,7 @@ postgresImportForeignSchema(ImportForeignSchemaStmt *stmt, Oid serverOid)
 	 */
 	server = GetForeignServer(serverOid);
 	mapping = GetUserMapping(GetUserId(), server->serverid);
-	conn = GetConnection(mapping, false);
+	conn = GetConnection(mapping, false, true, false);
 
 	/* Don't attempt to import collation if remote server hasn't got it */
 	if (PQserverVersion(conn) < 90100)
@@ -5590,3 +5644,26 @@ find_em_expr_for_rel(EquivalenceClass *ec, RelOptInfo *rel)
 	/* We didn't find any suitable equivalence class expression */
 	return NULL;
 }
+
+/*
+ * server_uses_two_phase_commit
+ * Returns true if the foreign server is configured to support 2PC.
+ */
+bool
+server_uses_two_phase_commit(ForeignServer *server)
+{
+	ListCell		*lc;
+
+	/* Check the options for two phase compliance */
+	foreach(lc, server->options)
+	{
+		DefElem    *d = (DefElem *) lfirst(lc);
+
+		if (strcmp(d->defname, "two_phase_commit") == 0)
+		{
+			return defGetBoolean(d);
+		}
+	}
+	/* By default a server is not 2PC compliant */
+	return false;
+}
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index d37cc88..b21a2fb 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -13,6 +13,7 @@
 #ifndef POSTGRES_FDW_H
 #define POSTGRES_FDW_H
 
+#include "access/fdwxact.h"
 #include "foreign/foreign.h"
 #include "lib/stringinfo.h"
 #include "nodes/relation.h"
@@ -115,7 +116,9 @@ extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
-extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt,
+							 bool start_transaction, bool connection_error_ok);
+extern PGconn *GetExistingConnection(Oid umid);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
@@ -123,6 +126,14 @@ extern PGresult *pgfdw_get_result(PGconn *conn, const char *query);
 extern PGresult *pgfdw_exec_query(PGconn *conn, const char *query);
 extern void pgfdw_report_error(int elevel, PGresult *res, PGconn *conn,
 				   bool clear, const char *sql);
+extern bool postgresPrepareForeignTransaction(Oid serverid, Oid userid,
+											  Oid umid, const char *prep_id);
+extern bool postgresEndForeignTransaction(Oid serverid, Oid userid,
+										  Oid umid, bool is_commit);
+extern bool postgresResolvePreparedForeignTransaction(Oid serverid, Oid userid,
+													  Oid umid, bool is_commit,
+													  const char *prep_id);
+
 
 /* in option.c */
 extern int ExtractConnectionOptions(List *defelems,
@@ -179,6 +190,7 @@ extern void deparseSelectStmtForRel(StringInfo buf, PlannerInfo *root,
 						List *remote_conds, List *pathkeys, bool is_subquery,
 						List **retrieved_attrs, List **params_list);
 extern const char *get_jointype_name(JoinType jointype);
+extern bool server_uses_two_phase_commit(ForeignServer *server);
 
 /* in shippable.c */
 extern bool is_builtin(Oid objectId);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 0b2c528..7f74356 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -15,6 +15,10 @@ DO $d$
             OPTIONS (dbname '$$||current_database()||$$',
                      port '$$||current_setting('port')||$$'
             )$$;
+        EXECUTE $$CREATE SERVER loopback3 FOREIGN DATA WRAPPER postgres_fdw
+            OPTIONS (dbname '$$||current_database()||$$',
+                     port '$$||current_setting('port')||$$'
+            )$$;
     END;
 $d$;
 
@@ -22,6 +26,7 @@ CREATE USER MAPPING FOR public SERVER testserver1
 	OPTIONS (user 'value', password 'value');
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback;
 CREATE USER MAPPING FOR CURRENT_USER SERVER loopback2;
+CREATE USER MAPPING FOR CURRENT_USER SERVER loopback3;
 
 -- ===================================================================
 -- create objects used through FDW loopback server
@@ -56,6 +61,14 @@ CREATE TABLE "S 1"."T 4" (
 	c3 text,
 	CONSTRAINT t4_pkey PRIMARY KEY (c1)
 );
+CREATE TABLE "S 1"."T 5" (
+       c1 int NOT NULL
+);
+
+CREATE TABLE "S 1"."T 6" (
+       c1 int NOT NULL,
+       CONSTRAINT t6_pkey PRIMARY KEY (c1)
+);
 
 INSERT INTO "S 1"."T 1"
 	SELECT id,
@@ -88,6 +101,7 @@ ANALYZE "S 1"."T 1";
 ANALYZE "S 1"."T 2";
 ANALYZE "S 1"."T 3";
 ANALYZE "S 1"."T 4";
+ANALYZE "S 1"."T 5";
 
 -- ===================================================================
 -- create foreign tables
@@ -136,6 +150,19 @@ CREATE FOREIGN TABLE ft6 (
 	c3 text
 ) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 4');
 
+CREATE FOREIGN TABLE ft7_not_twophase (
+       c1 int NOT NULL
+) SERVER loopback OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft8_twophase (
+       c1 int NOT NULL
+) SERVER loopback2 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+CREATE FOREIGN TABLE ft9_twophase (
+       c1 int NOT NULL
+) SERVER loopback3 OPTIONS (schema_name 'S 1', table_name 'T 5');
+
+
 -- A table with oids. CREATE FOREIGN TABLE doesn't support the
 -- WITH OIDS option, but ALTER does.
 CREATE FOREIGN TABLE ft_pg_type (
@@ -1905,3 +1932,109 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
 
 RESET enable_partition_wise_join;
+
+-- ===================================================================
+-- test Atomic commit across foreign servers
+-- ===================================================================
+
+ALTER SERVER loopback OPTIONS(ADD two_phase_commit 'off');
+ALTER SERVER loopback2 OPTIONS(ADD two_phase_commit 'on');
+ALTER SERVER loopback3 OPTIONS(ADD two_phase_commit 'on');
+
+-- Check two_phase_commit setting
+SELECT srvname FROM pg_foreign_server WHERE 'two_phase_commit=on' = ANY(srvoptions) or 'two_phase_commit=off' = ANY(srvoptions);
+
+-- modify one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+COMMIT;
+SELECT * FROM ft8_twophase;
+
+-- modify one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(1);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+
+-- modify two supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(2);
+INSERT INTO ft9_twophase VALUES(2);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify two supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(3);
+INSERT INTO ft9_twophase VALUES(3);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- modify local and one supported server and commit.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(4);
+INSERT INTO "S 1"."T 6" VALUES (4);
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify local and one supported server and rollback.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(5);
+INSERT INTO "S 1"."T 6" VALUES (5);
+ROLLBACK;
+SELECT * FROM ft8_twophase;
+SELECT * FROM "S 1"."T 6";
+
+-- modify supported server and non-supported server and commit.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(6);
+INSERT INTO ft8_twophase VALUES(6);
+COMMIT;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify supported server and non-supported server and rollback.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(7);
+INSERT INTO ft8_twophase VALUES(7);
+ROLLBACK;
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
+
+-- modify foreign server and raise an error
+BEGIN;
+INSERT INTO ft8_twophase VALUES(8);
+INSERT INTO ft9_twophase VALUES(NULL); -- violation
+COMMIT;
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- commit and rollback foreign transactions that are part of
+-- prepare transaction.
+BEGIN;
+INSERT INTO ft8_twophase VALUES(9);
+INSERT INTO ft9_twophase VALUES(9);
+PREPARE TRANSACTION 'gx1';
+COMMIT PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+BEGIN;
+INSERT INTO ft8_twophase VALUES(10);
+INSERT INTO ft9_twophase VALUES(10);
+PREPARE TRANSACTION 'gx1';
+ROLLBACK PREPARED 'gx1';
+SELECT * FROM ft8_twophase;
+SELECT * FROM ft9_twophase;
+
+-- fails, cannot prepare the transaction if non-supporeted
+-- server involved in.
+BEGIN;
+INSERT INTO ft7_not_twophase VALUES(11);
+INSERT INTO ft8_twophase VALUES(11);
+PREPARE TRANSACTION 'gx1';
+SELECT * FROM ft7_not_twophase;
+SELECT * FROM ft8_twophase;
diff --git a/doc/src/sgml/postgres-fdw.sgml b/doc/src/sgml/postgres-fdw.sgml
index 54b5e98..f065b7b 100644
--- a/doc/src/sgml/postgres-fdw.sgml
+++ b/doc/src/sgml/postgres-fdw.sgml
@@ -436,6 +436,43 @@
    </para>
 
   </sect3>
+
+  <sect3>
+   <title>Transaction Management Options</title>
+
+   <para>
+    By default, if the transaction involves with multiple remote server,
+    each transaction on remote server is committed or aborted independently.
+    Some of transactions may fail to commit on remote server while other
+    transactions commit successfully. This may be overridden using
+    following option:
+   </para>
+
+   <variablelist>
+
+    <varlistentry>
+     <term><literal>two_phase_commit</literal></term>
+     <listitem>
+      <para>
+       This option controls whether <filename>postgres_fdw</filename> allows
+       to use two-phase-commit when transaction commits. This option can
+       only be sepcified for foreign servers, not per-table.
+       The default is <literal>false</literal>.
+      </para>
+
+      <para>
+       If this option is enabled, <filename>postgres_fdw</filename> prepares
+       transaction on remote server and <productname>PostgreSQL</productname>
+       keeps track of the distributed transaction.
+       <xref linkend="guc-max-prepared-foreign-transactions"/> must be set more
+       than 1 on local server and <xref linkend="guc-max-prepared-transactions"/>
+       must set to more than 1 on remote server.
+      </para>
+     </listitem>
+    </varlistentry>
+
+   </variablelist>
+  </sect3>
  </sect2>
 
  <sect2>
-- 
1.7.1

#179

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Masahiko Sawada (#178)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Overall, what's the status of this patch? Are we hung up on this
issue only, or are there other things?

AFAIK there is no more technical issue in this patch so far other than
this issue. The patch has tests and docs, and includes all stuff to
support atomic commit to distributed transactions: the introducing
both the atomic commit ability to distributed transactions and some
corresponding FDW APIs, and having postgres_fdw support 2pc. I think
this patch needs to be reviewed, especially the functionality of
foreign transaction resolution which is re-designed before.

OK. I'm going to give 0002 a read-through now, but I think it would
be a good thing if others also contributed to the review effort.
There is a lot of code here, and there are a lot of other patches
competing for attention. That said, here we go:

In the documentation for pg_prepared_fdw_xacts, the first two columns
have descriptions ending in a preposition. That's typically to be
avoided in formal writing. The first one can be fixed by moving "in"
before "which". The second can be fixed by changing "that" to "with
which" and then dropping the trailing "with". The first three columns
have descriptions ending in a period; the latter two do not. Make it
consistent with whatever the surrounding style is, or at least
internally consistent if the surrounding style varies. Also, some of
the descriptions begin with "the" and others do not; again, seems
better to be consistent and adhere to surrounding style. The
documentation of the serverid column seems to make no sense. Possibly
you mean "OID of the foreign server on which this foreign transaction
is prepared"? As it is, you use "foreign server" twice, which is why
I'm confused.

The documentation of max_prepared_foreign_transactions seems a bit
brief. I think that if I want to be able to use 2PC for N
transactions each across K servers, this variable needs to be set to
N*K, not just N. That's not the right way to word it here, but
somehow you probably want to explain that a single local transaction
can give rise to multiple foreign transactions and that this should be
taken into consideration when setting a value for this variable.
Maybe also include a link to where the user can find more information,
which brings me to another point: there doesn't seem to be any
general, user-facing explanation of this system. You explain the
catalogs, the GUCs, the interface, etc. but there's nothing anywhere
that explains the overall point of the whole thing, which seems pretty
important. The closest thing you've got is a description for people
writing FDWs, but we really need a place to explain the concept to
*users*. One idea is to add a new chapter to the "Server
Administration" section, maybe just after "Logical Replication" and
before "Regression Tests". But I'm open to other ideas.

It's important that the documentation of the various GUCs provide
users with some clue as to how to set them. I notice this
particularly for foreign_transaction_resolution_interval; off the top
of my head, without having read the rest of this patch, I don't know
why I shouldn't want this to be zero. But the others could use more
explanation as well.

It is unclear from reading the documentation for GetPreparedId why
this should be the responsibility of the FDW, or how the FDW is
supposed to guarantee uniqueness.

PrepareForignTransaction is spelled wrong. Nearby typos: prepareing,
tranasaction. A bit further down, "can _prepare" has an unwanted
space in the middle. Various places in this section capitalize
certain words in the middle of sentences which is not standard English
style. For example, in "Every Foreign Data Wrapper is required..."
the word "Every" is appropriately capitalized because it begins a
sentence, but there's no reason to capitalize the others. Likewise
for "...employs Two-phase commit protocol..." and other similar cases.

EndForeignTransaction doesn't explain what sorts of things the FDW can
legally do when called, or how this method is intended to be used.
Those seem like important details. Especially, an FDW that tries to
do arbitrary stuff during abort cleanup will probably cause big
problems.

The fdw-transactions section of the documentation seems to imply that
henceforth every FDW must call FdwXactRegisterForeignServer, which I
think is an unacceptable API break.

It doesn't seem advisable to make this behavior depend on
max_prepared_foreign_transactions. I think that it should be an
server-level option: use 2PC for this server, or not? FDWs that don't
have 2PC default to "no"; but others can be set to "yes" if the user
wishes. But we shouldn't force that decision to be on a cluster-wide
basis.

+    <xref linkend="functions-fdw-transaction-table"/> shows the functions
+    available for foreign transaction managements.

management

+ Resolve a foreign transaction. This function search foreign transaction

searches for

+ matching the criteria and resolve then. This function doesn't resolve

critera->arguments, resolve->resolves, doesn't->won't

+ an entry of which transaction is in-progress, or that is locked by some

a foreign transaction which is in progress, or one that is locked by some

This doesn't seem like a great API contract. It would be nice to have
the guarantee that, if the function returns without error, all
transactions that were prepared before this function was run and which
match the given arguments are now resolved. Skipping locked
transactions removes that guarantee.

+        This function works the same as
<function>pg_resolve_fdw_xact</function>
+        except it remove foreign transaction entry without resolving.

Explain why that's useful.

+ <entry>OID of the database that the foreign transaction resolver
process connects to</entry>

to which the ... is connected

+ <entry>Time of last resolved a foreign transaction</entry>

Time at which the process last resolved a foreign transaction

+ of foreign trasactions.

The new wait events aren't documented.

Spelling error.

+ * This module manages the transactions involving foreign servers.

Remove this. Doesn't add any information.

+ * This comment summarises how the transaction manager handles transactions
+ * involving one or more foreign servers.

This too.

+ * connection is identified by oid fo foreign server and user.

fo -> of

+ * first phase doesn not succeed for whatever reason, the foreign servers

doesn -> does

But more generally:

+ * The commit is executed in two phases. In the first phase executed during
+ * pre-commit phase, transactions are prepared on all the foreign servers,
+ * which can participate in two-phase commit protocol. Transaction on other
+ * foreign servers are committed in the same phase. In the second phase, if
+ * first phase doesn not succeed for whatever reason, the foreign servers
+ * are asked to rollback respective prepared transactions or abort the
+ * transactions if they are not prepared. This process is executed by backend
+ * process that executed the first phase. If the first phase succeeds, the
+ * backend process registers ourselves to the queue in the shared
memory and then
+ * ask the foreign transaction resolver process to resolve foreign transactions
+ * that are associated with the its transaction. After resolved all foreign
+ * transactions by foreign transaction resolve process the backend wakes up
+ * and resume to process.

The only way this can be reliable, I think, is if we prepare all of
the remote transactions before committing locally and commit them
after committing locally. Otherwise, if we crash or fail before
committing locally, our atomic commit is no longer atomic. I think
the way this should work is: during pre-commit, we prepare the
transaction everywhere. After committing or rolling back, we notify
the resolver process and tell it to try to commit or roll back those
transactions. If we ask it to commit, we also tell it to notify us
when it's done, so that we can wait (interruptibly) for it to finish,
and so that we're not trying to locally do work that might fail with
an ERROR after already committed (that would confuse the client). If
the user hits ^C, then we handle it like synchronous replication: we
emit a WARNING saying that the transaction might not be remotely
committed yet, and return success. I see you've got that logic in
FdwXactWaitToBeResolved() so maybe this comment isn't completely in
sync with the latest version of the patch, but I think there are some
remaining ordering problems as well; see below.

I think it is, generally, confusing to describe this process as having
two phases. For one thing, two-phase commit has two phases, and
they're not these two phases, but we're talking about them in a patch
about two-phase commit. But also, they really aren't two phases.
Saying something has two phases means that A happens and then B
happens. But, at least as this is presently described here, B might
not need to happen at all. So that's not really a phase. I think you
need a different word here, like maybe "ways", unless I'm just
misunderstanding what this whole thing is saying.

+ *    * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+ *      have been modified to go through fdw_xact->inredo entries that have
+ *      not made to disk yet.

This doesn't seem to be true. I see no reference to these functions
being modified elsewhere in the patch. Nor is it clear why they would
need to be modified. For local 2PC, prepared transactions need to be
included in all snapshots that are taken. Otherwise, if a local 2PC
transaction were committed, concurrent transactions would see the
effects of that transaction appear all at once, even though they
hadn't gotten a new snapshot. That is the reason why we need
StandbyRecoverPreparedTransactions() before opening for hot standby.
But for FDW 2PC, even if we knew which foreign transactions were
prepared but not yet committed, we have no mechanism for preventing
those changes from being visible on the remote servers, nor do they
have any effect on local visibility. So there's no need for this
AFAICS. Similarly, this appears to me to be incorrect:

+        RecoveryRequiresIntParameter("max_prepared_foreign_transactions",
+                                     max_prepared_foreign_xacts,
+                                     ControlFile->max_prepared_foreign_xacts);

I might be confused here, but it seems to me that the value of
max_prepared_foreign_transactions is irrelevant to whether we can
initialize Hot Standby, because, again, those remote xacts have no
effect on our local snapshot. Rather, the problem is that we are
*altogether unable to proceed with recovery* if this value is too low,
regardless of whether we are doing Hot Standby or not. twophase.c
handles that by just erroring out in PrepareRedoAdd() if we run out of
slots, and insert_fdw_xact does the same thing (although that message
doesn't follow style guidelines -- no space before a colon, please!).
So it seems to me that you can just delete this code and the
associated documentation mention; these concerns are irrelevant here
and the actual failure case is otherwise handled.

+ *      We save them to disk and alos set fdw_xact->ondisk to true.
+ *    * RecoverPreparedTrasactions() and StandbyRecoverPreparedTransactions()
+                     errmsg("prepread foreign transactions are disabled"),
+                 errmsg("out of foreign trasanction resolver slots"),

More typos: alos, RecoverPreparedTrasactions, prepread, trasanction

+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#include "postgres.h"

Thou shalt not have any #includes before "postgres.h"

+#include "miscadmin.h"
+#include "funcapi.h"

These should be in the main alphabetized list. If that doesn't work,
then some header is broken.

+        if (fdw_conn->serverid == serverid && fdw_conn->userid == userid)
+        {
+            fdw_conn->modified |= modify;

I suggest avoiding |= with Boolean. It might be harmless, but it
would be just as easy to write this as if (existing conditions &&
!fdw_conn->modified) fdw_conn->modified = true, which avoids any
assumption about the bit patterns.

+        max_hash_size = max_prepared_foreign_xacts;
+        init_hash_size = max_hash_size / 2;

I think we decided that hash tables other than the lock manager should
initialize with maximum size = initial size. See
7c797e7194d969f974abf579cacf30ffdccdbb95.

+ if (list_length(MyFdwConnections) < 1)

How about if (MyFdwConnections) == NIL?

This occurs multiple times in the patch.

+ if (list_length(MyFdwConnections) == 0)

How about if (MyFdwConnections) == NIL?

+    if ((list_length(MyFdwConnections) > 1) ||
+        (list_length(MyFdwConnections) == 1 && (MyXactFlags &
XACT_FLAGS_WROTENONTEMPREL)))
+        return true;

I think this would be clearer written this way:

int nserverswritten = list_length(MyFdwConnections);
if (MyXactFlags & XACT_FLAGS_WROTENONTEMPREL != 0)
++nserverswritten;
return nserverswritten > 1;

But that brings up another issue: why is MyFdwConnections named that
way and why does it have those semantics? That is, why do we really
need a list of every FDW connection? I think we really only need the
ones that are 2PC-capable writes. If a non-2PC-capable foreign server
is involved in the transaction, then we don't really to keep it in a
list. We just need to flag the transaction as not locally prepareable
i.e. clear TwoPhaseReady. I think that this could all be figured out
in a much less onerous way: if we ever perform a modification of a
foreign table, have nodeModifyTable.c either mark the transaction
non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
server is not 2PC capable, or otherwise add the appropriate
information to MyFdwConnections, which can then be renamed to indicate
that it contains only information about preparable stuff. Then you
don't need each individual FDW to be responsible about calling some
new function; the core code just does the right thing automatically.

+        if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+                                        fdw_conn->umid, true))
+            elog(WARNING, "could not commit transaction on server %s",
+                 fdw_conn->servername);

First, as I noted above, this assumes that the local transaction can't
fail after this point, which is certainly false -- if nothing else,
think about a plug pull. Second, it assumes that the right thing to
do would be to throw a WARNING, which I think is false: if this is
running in the pre-commit phase, it's not too late to switch to the
abort path, and certainly if we make changes on only 1 server and the
commit on that server fails, we should be rolling back, not committing
with a warning. Third, if we did need to restrict ourselves to
warnings here, that's probably impractical. This function needs to do
network I/O, which is not a no-fail operation, and even if it works,
the remote side can fail in any arbitrary way.

+FdwXactRegisterFdwXact(Oid dbid, TransactionId xid, Oid serverid, Oid userid,

This is not a very clear function name. Apparently, the thing that is
doing the registration is an FdwXact, and the thing being registered
is also an FdwXact.

+        /*
+         * Between FdwXactRegisterFdwXact call above till this
backend hears back
+         * from foreign server, the backend may abort the local transaction
+         * (say, because of a signal). During abort processing, it will send
+         * an ABORT message to the foreign server. If the foreign server has
+         * not prepared the transaction, the message will succeed. If the
+         * foreign server has prepared transaction, it will throw an error,
+         * which we will ignore and the prepared foreign transaction will be
+         * resolved by a foreign transaction resolver.
+         */
+        if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid,
fdw_conn->userid,
+                                            fdw_conn->umid, fdw_xact_id))
+        {

Again, I think this is an impractical API contract. It assumes that
prepare_foreign_xact can't throw errors, which is likely to make for a
lot of implementation problems on the FDW side -- you can't even
palloc. You can't call any existing code you've already got that
might throw an error for any reason. You definitely can't do a
syscache lookup. You can't accept interrupts, even though you're
doing network I/O that could hang. The only reason to structure it
this way is to avoid having a transaction that we think is prepared in
our local bookkeeping when, on the remote side, it really isn't. But
that is an unavoidable problem, because the whole system could crash
after the remote prepare has been done and before prepare_foreign_xact
even returns control. Or we could fail after XLogInsert() and before
XLogFlush().

AtEOXact_FdwXacts has the same problem.

I think you can fix make this a lot cleaner if you make it part of the
API contract that resolver shouldn't fail when asked to roll back a
transaction that was never prepared. Then it can work like this: just
before we prepare the remote transaction, we note the information in
shared memory and in the WAL. If that fails, we just abort. When the
resolver sees that our xact is no longer running and did not commit,
it concludes that we must have failed and tries to roll back all the
remotely-prepared xacts. If they don't exist, then the PREPARE
failed; if they do, then we roll 'em back. On the other hand, if all
of the prepares succeed, then we commit. Seeing that our XID is no
longer running and did commit, the resolver tries to commit those
remotely-prepared xacts. In the commit case, we log a complaint if we
don't find the xact, but in the abort case, it's got to be an expected
outcome. If we really wanted to get paranoid about this, we could log
a WAL record after finishing all the prepares -- or even each prepare
-- saying "hey, after this point the remote xact should definitely
exist!". And then the resolver could complaints a nonexistent remote
xact when rolling back only if that record hasn't been issued yet.
But to me that looks like overkill; the window between issuing that
record and issuing the actual commit record would be very narrow, and
in most cases they would be flushed together anyway. We could of
course force the new record to be separately flushed first, but that's
just making everything slower for no gain.

Note that this requires that we not truncate away portions of clog
that contain commit status information about no-longer-running
transactions that have unresolved FDW 2PC xacts, or at least that we
issue a WAL record updating the state of the fdw_xact so that it
doesn't refer to that portion of clog any more - e.g. by setting the
XID to either FrozenTransactionId or InvalidTransactionId, though that
would be a problem since it would break the unique-XID assumption. I
don't see the patch doing either of those things right now, although I
may have missed it. Note that here again the differences between
local 2PC and FDW 2PC really make a difference. Local 2PC has a
PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent
clog truncation, because the relevant XIDs are showing up in
snapshots. The PrescanPreparedTransactions stuff only needs to nail
things down until we reach consistency, and then the regular
mechanisms take over. We have no such easy out for this system.

+    if (unlink(path))
+        if (errno != ENOENT || giveWarning)

Poor style. Use &&, not nested ifs. Oh, I guess you copied this from
the twophase.c code; but let's fix it anyway.

It's not exactly clear to me what the point of "locked" FdwXacts is,
but whatever that point may be, how can remove_fdw_xact() get away
with summarily releasing a lock that may be held by some other
process? If we're the process that has the FdwXact locked, then we'll
delete it from MyLockedFdwXacts, but if some other process has it
locked, nothing will happen. If that's safe for some non-obvious
reason, it at least needs a comment.

I think this whole function could be written with a lot less nesting
if you first write a loop to find the appropriate value for cnt, then
error out if we end up with cnt >= FdwXactCtl->numFdwXacts, and then
finally do all of the stuff that happens once we identify a match.
That saves two levels of indentation for most of the function.

The delayCkpt interlocking which twophase.c uses is absent here.
That's important, because otherwise a checkpoint can happen between
the time we write to WAL and the time we actually perform the on-disk
operations. If a crash happens just before the operation is actually
performed, then it never happens on the master but still happens on
the standbys. Oops.

+void
+FdwXactRedoRemove(TransactionId xid, Oid serverid, Oid userid)
+{
+    FdwXact    fdw_xact;
+
+    Assert(RecoveryInProgress());
+
+    fdw_xact = get_fdw_xact(xid, serverid, userid);
+
+    if (fdw_xact)
+    {
+        /* Now we can clean up any files we already left */
+        Assert(fdw_xact->inredo);
+        remove_fdw_xact(fdw_xact);
+    }
+    else
+    {
+        /*
+         * Entry could be on disk. Call with giveWarning = false
+         * since it can be expected during replay.
+         */
+        RemoveFdwXactFile(xid, serverid, userid, false);
+    }
+}

I hope this won't sound too harsh, but he phrase that comes to mind
here is "spaghetti code". First, we look for a matching FdwXact in
shared memory and, if we find one, do all of the cleanup inside
remove_fdw_xact() which also removes it from the disk. Otherwise, we
try to remove it from disk anyway. if (condition) do_a_and_b(); else
do_b(); is not generally a good way to structure code. Moreover, it's
not clear why we should be doing it like this in the first place.
There's no similar logic in twophase.c; PrepareRedoRemove does nothing
on disk if the state isn't found in memory. The fact that you've got
this set up this way suggests that you haven't structured things so as
to guarantee that the in-memory state is always accurate. If so, that
should be fixed; if not, this code isn't needed anyway.

+    if (!fdwXactExitRegistered)
+    {
+        before_shmem_exit(AtProcExit_FdwXact, 0);
+        fdwXactExitRegistered = true;
+    }

Sadly, this code has a latent hazard. If somebody ever calls this
from inside a PG_ENSURE_ERROR_CLEANUP() block, then they can end up
failing to unregister their handler, because of the limitations
described in before_shmem_cleanup()'s handler. It's better to do this
in FdwXactShmemInit().

+ if (list_length(entries_to_resolve) == 0)

Here again, just test against NIL.

fdwxact_resolver.c is very light on comments.

+static
+void FdwXactRslvLoop(void)

Not project style. There are other, similar instances.

+ need_twophase = TwoPhaseCommitRequired();

I think this nomenclature is going to cause confusion. We need to
distinguish somehow between using remote 2PC for foreign transactions
and using local 2PC. The TwoPhase naming is already well-established
as referring to the latter, so I think we should name this some other
way.

+    if (fdw_xact_exists(InvalidTransactionId, MyDatabaseId, srvId, InvalidOid))
+    {
+        Form_pg_foreign_server srvForm = (Form_pg_foreign_server)
GETSTRUCT(tp);
+        ereport(ERROR,
+                (errmsg("server \"%s\" has unresolved prepared
transactions on it",
+                        NameStr(srvForm->srvname))));
+    }

I think if this happens, it would be more appropriate to just issue a
WARNING and forget about those transactions. Blocking DROP is not
nice, and shouldn't be done without a really good reason.

+ (errmsg("preparing foreign transactions
(max_prepared_foreign_transactions > 0) requires
maX_foreign_xact_resolvers > 0")));

Bogus capitalization.

+#define FDW_XACT_ID_LEN (2 + 1 + 8 + 1 + 8 + 1 + 8)

I am very much not impressed by this uncommented macro definition.
You can probably guess the reason. :-)

+            ereport(WARNING, (errmsg("could not resolve dangling
foreign transaction for xid %u, foreign server %u and user %d",
+                                     fdwxact->local_xid,
fdwxact->serverid, fdwxact->userid)));

Formatting is wrong.

My ability to concentrate on this patch is just about exhausted for
today so I think I'm going to have to stop here. But in general I
would say this patch still needs a lot of work. As noted above, the
concurrency, crash-safety, and error-handing issues don't seem to have
been thought through carefully enough, and there are even a fairly
large number of trivial spelling errors and coding and/or message
style violations. Comments are lacking in some places where they are
clearly needed. There seems to be a fair amount of work needed to
ensure that each thing has exactly one name: not two, and not a shared
name with something else, and that all of those names are clear.
There are a few TODO items remaining in the code. I think that it is
going to take a significant effort to get all of this cleaned up.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#180

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Robert Haas (#179)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Sat, Feb 10, 2018 at 4:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Feb 8, 2018 at 3:58 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Overall, what's the status of this patch? Are we hung up on this
issue only, or are there other things?

AFAIK there is no more technical issue in this patch so far other than
this issue. The patch has tests and docs, and includes all stuff to
support atomic commit to distributed transactions: the introducing
both the atomic commit ability to distributed transactions and some
corresponding FDW APIs, and having postgres_fdw support 2pc. I think
this patch needs to be reviewed, especially the functionality of
foreign transaction resolution which is re-designed before.

OK. I'm going to give 0002 a read-through now, but I think it would
be a good thing if others also contributed to the review effort.
There is a lot of code here, and there are a lot of other patches
competing for attention. That said, here we go:

I appreciate your reviewing. I'll thoroughly update the whole patch
based on your comments and suggestions. Here is answer, question and
my understanding.

The fdw-transactions section of the documentation seems to imply that
henceforth every FDW must call FdwXactRegisterForeignServer, which I
think is an unacceptable API break.

It doesn't seem advisable to make this behavior depend on
max_prepared_foreign_transactions. I think that it should be an
server-level option: use 2PC for this server, or not? FDWs that don't
have 2PC default to "no"; but others can be set to "yes" if the user
wishes. But we shouldn't force that decision to be on a cluster-wide
basis.

Since I've added a new option two_phase_commit to postgres_fdw we need
to ask FDW whether the foreign server is 2PC-capable server or not in
order to register the foreign server information. That's why the patch
asked FDW to call FdwXactRegisterForeignServer. However we can
register a foreign server information automatically by executor (e.g.
at BeginDirectModify and at BeginForeignModify) if a foreign server
itself has that information. We can add two_phase_commit_enabled
column to pg_foreign_server system catalog and that column is set to
true if the foriegn server is 2PC-capable (i.g. has enough functions)
and user want to use it.

But that brings up another issue: why is MyFdwConnections named that
way and why does it have those semantics? That is, why do we really
need a list of every FDW connection? I think we really only need the
ones that are 2PC-capable writes. If a non-2PC-capable foreign server
is involved in the transaction, then we don't really to keep it in a
list. We just need to flag the transaction as not locally prepareable
i.e. clear TwoPhaseReady. I think that this could all be figured out
in a much less onerous way: if we ever perform a modification of a
foreign table, have nodeModifyTable.c either mark the transaction
non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
server is not 2PC capable, or otherwise add the appropriate
information to MyFdwConnections, which can then be renamed to indicate
that it contains only information about preparable stuff. Then you
don't need each individual FDW to be responsible about calling some
new function; the core code just does the right thing automatically.

I could not get this comment. Did you mean that the foreign
transaction on not 2PC-capable foreign server should be end in the
same way as before (i.g. by XactCallback)?

Currently, because there is not FDW API to end foreign transaction,
almost FDWs use XactCallbacks to end the transaction. But after
introduced new FDW APIs, I think it's better to use FDW APIs to end
transactions rather than using XactCallbacks. Otherwise we end up with
having FDW APIs for 2PC (prepare and resolve) and XactCallback for
ending the transaction, which would be hard to understand. So I've
changed the foreign transaction management so that core code
explicitly asks FDW to end/prepare a foreign transaction instead of
ending it by individual FDWs. After introduced new FDW APIs, core code
can have the information of all foreign servers involved with the
transaction and call each APIs at appropriate timing.

+        if (!fdw_conn->end_foreign_xact(fdw_conn->serverid, fdw_conn->userid,
+                                        fdw_conn->umid, true))
+            elog(WARNING, "could not commit transaction on server %s",
+                 fdw_conn->servername);
First, as I noted above, this assumes that the local transaction can't
fail after this point, which is certainly false -- if nothing else,
think about a plug pull. Second, it assumes that the right thing to
do would be to throw a WARNING, which I think is false: if this is
running in the pre-commit phase, it's not too late to switch to the
abort path, and certainly if we make changes on only 1 server and the
commit on that server fails, we should be rolling back, not committing
with a warning. Third, if we did need to restrict ourselves to
warnings here, that's probably impractical. This function needs to do
network I/O, which is not a no-fail operation, and even if it works,
the remote side can fail in any arbitrary way.
+        /*
+         * Between FdwXactRegisterFdwXact call above till this
backend hears back
+         * from foreign server, the backend may abort the local transaction
+         * (say, because of a signal). During abort processing, it will send
+         * an ABORT message to the foreign server. If the foreign server has
+         * not prepared the transaction, the message will succeed. If the
+         * foreign server has prepared transaction, it will throw an error,
+         * which we will ignore and the prepared foreign transaction will be
+         * resolved by a foreign transaction resolver.
+         */
+        if (!fdw_conn->prepare_foreign_xact(fdw_conn->serverid,
fdw_conn->userid,
+                                            fdw_conn->umid, fdw_xact_id))
+        {
Again, I think this is an impractical API contract. It assumes that
prepare_foreign_xact can't throw errors, which is likely to make for a
lot of implementation problems on the FDW side -- you can't even
palloc. You can't call any existing code you've already got that
might throw an error for any reason. You definitely can't do a
syscache lookup. You can't accept interrupts, even though you're
doing network I/O that could hang. The only reason to structure it
this way is to avoid having a transaction that we think is prepared in
our local bookkeeping when, on the remote side, it really isn't. But
that is an unavoidable problem, because the whole system could crash
after the remote prepare has been done and before prepare_foreign_xact
even returns control. Or we could fail after XLogInsert() and before
XLogFlush().

AtEOXact_FdwXacts has the same problem.

I think you can fix make this a lot cleaner if you make it part of the
API contract that resolver shouldn't fail when asked to roll back a
transaction that was never prepared. Then it can work like this: just
before we prepare the remote transaction, we note the information in
shared memory and in the WAL. If that fails, we just abort. When the
resolver sees that our xact is no longer running and did not commit,
it concludes that we must have failed and tries to roll back all the
remotely-prepared xacts. If they don't exist, then the PREPARE
failed; if they do, then we roll 'em back. On the other hand, if all
of the prepares succeed, then we commit. Seeing that our XID is no
longer running and did commit, the resolver tries to commit those
remotely-prepared xacts. In the commit case, we log a complaint if we
don't find the xact, but in the abort case, it's got to be an expected
outcome. If we really wanted to get paranoid about this, we could log
a WAL record after finishing all the prepares -- or even each prepare
-- saying "hey, after this point the remote xact should definitely
exist!". And then the resolver could complaints a nonexistent remote
xact when rolling back only if that record hasn't been issued yet.
But to me that looks like overkill; the window between issuing that
record and issuing the actual commit record would be very narrow, and
in most cases they would be flushed together anyway. We could of
course force the new record to be separately flushed first, but that's
just making everything slower for no gain.

In FdwXactResolveForeignTranasction(), resolver concludes the fate of
transaction by seeing the status of fdwxact entry and the state of
local transaction in clog. what I need to do is making that function
log a complaint in commit case if couldn't find the prepared
transaction, and not do that in abort case. Also, postgres_fdw don't
raise an error even if we could not find prepared transaction on
foreign server because it might have been resolved by other process.
But this is now responsible by FDW. I should change it to resolver
side. That is, FDW can raise error in ordinarily way but core code
should catch and process it.

Note that this requires that we not truncate away portions of clog
that contain commit status information about no-longer-running
transactions that have unresolved FDW 2PC xacts, or at least that we
issue a WAL record updating the state of the fdw_xact so that it
doesn't refer to that portion of clog any more - e.g. by setting the
XID to either FrozenTransactionId or InvalidTransactionId, though that
would be a problem since it would break the unique-XID assumption. I
don't see the patch doing either of those things right now, although I
may have missed it. Note that here again the differences between
local 2PC and FDW 2PC really make a difference. Local 2PC has a
PGPROC+PGXACT, so the regular snaphot-taking code suffices to prevent
clog truncation, because the relevant XIDs are showing up in
snapshots. The PrescanPreparedTransactions stuff only needs to nail
things down until we reach consistency, and then the regular
mechanisms take over. We have no such easy out for this system.

You're right. Perhaps we can deal with it by PrescanFdwXacts until
reach consistent point, and then have vac_update_datfrozenxid check
local xids of un-resolved fdwxact to determine the new datfrozenxid.
Since the local xids of un-resolved fdwxacts would not be relevant
with vacuuming, we don't need to include it to snapshot and
GetOldestXmin etc. Also we hint to resolve fdwxact when near
wraparound.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#181

Robert Haas

robertmhaas@gmail.com

almost 8 years ago

In reply to: Masahiko Sawada (#180)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The fdw-transactions section of the documentation seems to imply that
henceforth every FDW must call FdwXactRegisterForeignServer, which I
think is an unacceptable API break.

It doesn't seem advisable to make this behavior depend on
max_prepared_foreign_transactions. I think that it should be an
server-level option: use 2PC for this server, or not? FDWs that don't
have 2PC default to "no"; but others can be set to "yes" if the user
wishes. But we shouldn't force that decision to be on a cluster-wide
basis.

Since I've added a new option two_phase_commit to postgres_fdw we need
to ask FDW whether the foreign server is 2PC-capable server or not in
order to register the foreign server information. That's why the patch
asked FDW to call FdwXactRegisterForeignServer. However we can
register a foreign server information automatically by executor (e.g.
at BeginDirectModify and at BeginForeignModify) if a foreign server
itself has that information. We can add two_phase_commit_enabled
column to pg_foreign_server system catalog and that column is set to
true if the foriegn server is 2PC-capable (i.g. has enough functions)
and user want to use it.

I don't see why this would need a new catalog column.

But that brings up another issue: why is MyFdwConnections named that
way and why does it have those semantics? That is, why do we really
need a list of every FDW connection? I think we really only need the
ones that are 2PC-capable writes. If a non-2PC-capable foreign server
is involved in the transaction, then we don't really to keep it in a
list. We just need to flag the transaction as not locally prepareable
i.e. clear TwoPhaseReady. I think that this could all be figured out
in a much less onerous way: if we ever perform a modification of a
foreign table, have nodeModifyTable.c either mark the transaction
non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
server is not 2PC capable, or otherwise add the appropriate
information to MyFdwConnections, which can then be renamed to indicate
that it contains only information about preparable stuff. Then you
don't need each individual FDW to be responsible about calling some
new function; the core code just does the right thing automatically.

I could not get this comment. Did you mean that the foreign
transaction on not 2PC-capable foreign server should be end in the
same way as before (i.g. by XactCallback)?

Currently, because there is not FDW API to end foreign transaction,
almost FDWs use XactCallbacks to end the transaction. But after
introduced new FDW APIs, I think it's better to use FDW APIs to end
transactions rather than using XactCallbacks. Otherwise we end up with
having FDW APIs for 2PC (prepare and resolve) and XactCallback for
ending the transaction, which would be hard to understand. So I've
changed the foreign transaction management so that core code
explicitly asks FDW to end/prepare a foreign transaction instead of
ending it by individual FDWs. After introduced new FDW APIs, core code
can have the information of all foreign servers involved with the
transaction and call each APIs at appropriate timing.

Well, it's one thing to introduce a new API. It's another thing to
require existing FDWs to be updated to use it. There are a lot of
existing FDWs out there, and I think that it is needlessly unfriendly
to force them all to be updated for v11 (or whenever this gets
committed) even if we think the new API is clearly better. FDWs that
work today should continue working after this patch is committed.

Separately, I think there's a question of whether the new API is in
fact better -- I'm not sure I have a completely well-formed opinion
about that yet.

In FdwXactResolveForeignTranasction(), resolver concludes the fate of
transaction by seeing the status of fdwxact entry and the state of
local transaction in clog. what I need to do is making that function
log a complaint in commit case if couldn't find the prepared
transaction, and not do that in abort case.

+1.

Also, postgres_fdw don't
raise an error even if we could not find prepared transaction on
foreign server because it might have been resolved by other process.

+1.

But this is now responsible by FDW. I should change it to resolver
side. That is, FDW can raise error in ordinarily way but core code
should catch and process it.

I don't understand exactly what you mean here.

You're right. Perhaps we can deal with it by PrescanFdwXacts until
reach consistent point, and then have vac_update_datfrozenxid check
local xids of un-resolved fdwxact to determine the new datfrozenxid.
Since the local xids of un-resolved fdwxacts would not be relevant
with vacuuming, we don't need to include it to snapshot and
GetOldestXmin etc. Also we hint to resolve fdwxact when near
wraparound.

I agree with you about snapshots, but I'm not sure about GetOldestXmin.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#182

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: Robert Haas (#181)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Wed, Feb 21, 2018 at 6:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 13, 2018 at 5:42 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

The fdw-transactions section of the documentation seems to imply that
henceforth every FDW must call FdwXactRegisterForeignServer, which I
think is an unacceptable API break.

It doesn't seem advisable to make this behavior depend on
max_prepared_foreign_transactions. I think that it should be an
server-level option: use 2PC for this server, or not? FDWs that don't
have 2PC default to "no"; but others can be set to "yes" if the user
wishes. But we shouldn't force that decision to be on a cluster-wide
basis.

Since I've added a new option two_phase_commit to postgres_fdw we need
to ask FDW whether the foreign server is 2PC-capable server or not in
order to register the foreign server information. That's why the patch
asked FDW to call FdwXactRegisterForeignServer. However we can
register a foreign server information automatically by executor (e.g.
at BeginDirectModify and at BeginForeignModify) if a foreign server
itself has that information. We can add two_phase_commit_enabled
column to pg_foreign_server system catalog and that column is set to
true if the foriegn server is 2PC-capable (i.g. has enough functions)
and user want to use it.

I don't see why this would need a new catalog column.

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

But that brings up another issue: why is MyFdwConnections named that
way and why does it have those semantics? That is, why do we really
need a list of every FDW connection? I think we really only need the
ones that are 2PC-capable writes. If a non-2PC-capable foreign server
is involved in the transaction, then we don't really to keep it in a
list. We just need to flag the transaction as not locally prepareable
i.e. clear TwoPhaseReady. I think that this could all be figured out
in a much less onerous way: if we ever perform a modification of a
foreign table, have nodeModifyTable.c either mark the transaction
non-preparable by setting XACT_FLAGS_FDWNOPREPARE if the foreign
server is not 2PC capable, or otherwise add the appropriate
information to MyFdwConnections, which can then be renamed to indicate
that it contains only information about preparable stuff. Then you
don't need each individual FDW to be responsible about calling some
new function; the core code just does the right thing automatically.

I could not get this comment. Did you mean that the foreign
transaction on not 2PC-capable foreign server should be end in the
same way as before (i.g. by XactCallback)?

Currently, because there is not FDW API to end foreign transaction,
almost FDWs use XactCallbacks to end the transaction. But after
introduced new FDW APIs, I think it's better to use FDW APIs to end
transactions rather than using XactCallbacks. Otherwise we end up with
having FDW APIs for 2PC (prepare and resolve) and XactCallback for
ending the transaction, which would be hard to understand. So I've
changed the foreign transaction management so that core code
explicitly asks FDW to end/prepare a foreign transaction instead of
ending it by individual FDWs. After introduced new FDW APIs, core code
can have the information of all foreign servers involved with the
transaction and call each APIs at appropriate timing.

Well, it's one thing to introduce a new API. It's another thing to
require existing FDWs to be updated to use it. There are a lot of
existing FDWs out there, and I think that it is needlessly unfriendly
to force them all to be updated for v11 (or whenever this gets
committed) even if we think the new API is clearly better. FDWs that
work today should continue working after this patch is committed.

Agreed.

Separately, I think there's a question of whether the new API is in
fact better -- I'm not sure I have a completely well-formed opinion
about that yet.

I think one API should do one job. In terms of keeping simple API the
current four APIs would be not bad. AFAICS other DB server that
support 2PC such as MySQL, Oracle etc can be satisfied with this API.
I'm thinking to remove a user mapping id from argument of three APIs
(preparing, resolution and end). Because user mapping id can be found
by {serverid, userid}. Also we can make get prepare-id API an optional
API. That is, if FDW doesn't define this API the core code always
passes px_<randam>_<serverid>_<userid> by default. For foreign server
that has the short limit of prepare-id length, it need to provide the
API.

In FdwXactResolveForeignTranasction(), resolver concludes the fate of
transaction by seeing the status of fdwxact entry and the state of
local transaction in clog. what I need to do is making that function
log a complaint in commit case if couldn't find the prepared
transaction, and not do that in abort case.

+1.

Also, postgres_fdw don't
raise an error even if we could not find prepared transaction on
foreign server because it might have been resolved by other process.

+1.

But this is now responsible by FDW. I should change it to resolver
side. That is, FDW can raise error in ordinarily way but core code
should catch and process it.

I don't understand exactly what you mean here.

Hmm I think I got confused. My understanding is that logging a
complaint in commit case and not doing that in abort case if prepared
transaction doesn't exist is a core code's job. An API contract here
is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if
there is no such transaction. Since it's an expected case in abort
case for the fdwxact manager, the core code can regard the error as
not actual problem. So for FDWs basically they can raise an error if
resolution is failed for whatever reason. But postgres_fdw doesn't
raise an error in case where prepared transaction doesn't exist
because in PostgreSQL prepared transaction can be ended by other
process. The pseudo-code of resolution part is following.

---
// Core code of foreign transaction resoliution
PG_TRY();
{
call API to resolve foreign transaction
}
PG_CATCH();
{
if (sqlstate is ERRCODE_UNDEFINED_OBJECT)
{
if (we are committing)
raise ERROR
else // we are aborting
raise WARNING // this is an expected result
}
else
raise ERROR // re-throw the error
}
PG_END_TRY();

// postgres_fdw code of prepared transaction resolution
do "COMMIT/ROLLBACK PREPARED"
if (failed to resolve)
{
if (sqlstate is ERRCODE_UNDEFINED_OBJECT)
{
raise WARNING
return true; // regards as succeeded
}
else
raise ERROR // failed to resolve
}
return true;
---

Or do you mean that FDWs should not raise an error if there is the
prepared transaction, and then core code doesn't need to check
sqlstate in case of error?

You're right. Perhaps we can deal with it by PrescanFdwXacts until
reach consistent point, and then have vac_update_datfrozenxid check
local xids of un-resolved fdwxact to determine the new datfrozenxid.
Since the local xids of un-resolved fdwxacts would not be relevant
with vacuuming, we don't need to include it to snapshot and
GetOldestXmin etc. Also we hint to resolve fdwxact when near
wraparound.

I agree with you about snapshots, but I'm not sure about GetOldestXmin.

Hmm, although I've thought concern in case where we don't consider
local xids of un-resolved fdwxact in GetOldestXmin, I could not find
problem. Could you share your concern if you have? I'll try to find a
possibility based on it.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#183

David Steele

david@pgmasters.net

almost 8 years ago

In reply to: Masahiko Sawada (#182)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On 2/27/18 2:21 AM, Masahiko Sawada wrote:

Hmm, although I've thought concern in case where we don't consider
local xids of un-resolved fdwxact in GetOldestXmin, I could not find
problem. Could you share your concern if you have? I'll try to find a
possibility based on it.

It appears that this entry should be marked Waiting on Author so I have
done that.

I also think it may be time to move this patch to the next CF.

Regards,
--
-David
david@pgmasters.net

#184

Masahiko Sawada

sawada.mshk@gmail.com

almost 8 years ago

In reply to: David Steele (#183)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Thu, Mar 29, 2018 at 2:27 AM, David Steele <david@pgmasters.net> wrote:

On 2/27/18 2:21 AM, Masahiko Sawada wrote:

Hmm, although I've thought concern in case where we don't consider
local xids of un-resolved fdwxact in GetOldestXmin, I could not find
problem. Could you share your concern if you have? I'll try to find a
possibility based on it.

It appears that this entry should be marked Waiting on Author so I have
done that.

I also think it may be time to move this patch to the next CF.

I agree to move this patch to the next CF.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#185

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#182)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

Well, FdwXactRegisterForeignServer has a "bool two_phase_commit"
argument. If it only needs to be called by FDWs that support 2PC,
then that argument is unnecessary. If it needs to be called by all
FDWs, then it breaks existing FDWs that don't call it.

But this is now responsible by FDW. I should change it to resolver
side. That is, FDW can raise error in ordinarily way but core code
should catch and process it.

I don't understand exactly what you mean here.

Hmm I think I got confused. My understanding is that logging a
complaint in commit case and not doing that in abort case if prepared
transaction doesn't exist is a core code's job. An API contract here
is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if
there is no such transaction. Since it's an expected case in abort
case for the fdwxact manager, the core code can regard the error as
not actual problem.

In general, it's not safe to catch an error and continue unless you
protect the code that throws the error by a sub-transaction. That
means we shouldn't expect the FDW to throw an error when the prepared
transaction isn't found and then just have the core code ignore the
error. Instead the FDW should return normally and, if the core code
needs to know whether the prepared transaction was found, then the FDW
should indicate this through a return value, not an ERROR.

Or do you mean that FDWs should not raise an error if there is the
prepared transaction, and then core code doesn't need to check
sqlstate in case of error?

Right. As noted above, that's unsafe, so we shouldn't do it.

You're right. Perhaps we can deal with it by PrescanFdwXacts until
reach consistent point, and then have vac_update_datfrozenxid check
local xids of un-resolved fdwxact to determine the new datfrozenxid.
Since the local xids of un-resolved fdwxacts would not be relevant
with vacuuming, we don't need to include it to snapshot and
GetOldestXmin etc. Also we hint to resolve fdwxact when near
wraparound.

I agree with you about snapshots, but I'm not sure about GetOldestXmin.

Hmm, although I've thought concern in case where we don't consider
local xids of un-resolved fdwxact in GetOldestXmin, I could not find
problem. Could you share your concern if you have? I'll try to find a
possibility based on it.

I don't remember exactly what I was thinking when I wrote that, but I
think the point is that GetOldestXmin() does a bunch of things other
than control the threshold for VACUUM, and we'd need to study them all
and look for problems. For example, it won't do for an XID to get
reused while it's still associated with an unresolved fdwxact. We
therefore certainly need such XIDs to hold back the cluster-wide
threshold for clog truncation in some manner -- and maybe that
involves GetOldestXmin(). Or maybe not. But anyway the point,
broadly considered, is that GetOldestXmin() is used in various ways,
and I don't know if we've thought through all of the consequences in
regard to this new feature.

Can I ask what your time frame is for updating this patch?
Considering how much work appears to remain, if you want to get this
committed to v12, it would be best to get started as early as
possible.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#186

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Robert Haas (#185)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

Thank you for the comment.

On Fri, May 11, 2018 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

Well, FdwXactRegisterForeignServer has a "bool two_phase_commit"
argument. If it only needs to be called by FDWs that support 2PC,
then that argument is unnecessary. If it needs to be called by all
FDWs, then it breaks existing FDWs that don't call it.

I understood now. For now since FdwXactRegisterForeignServer needs to
be called by only FDWs that support 2PC, I will remove the argument.

But this is now responsible by FDW. I should change it to resolver
side. That is, FDW can raise error in ordinarily way but core code
should catch and process it.

I don't understand exactly what you mean here.

Hmm I think I got confused. My understanding is that logging a
complaint in commit case and not doing that in abort case if prepared
transaction doesn't exist is a core code's job. An API contract here
is that FDW raise an error with ERRCODE_UNDEFINED_OBJECT error code if
there is no such transaction. Since it's an expected case in abort
case for the fdwxact manager, the core code can regard the error as
not actual problem.

In general, it's not safe to catch an error and continue unless you
protect the code that throws the error by a sub-transaction. That
means we shouldn't expect the FDW to throw an error when the prepared
transaction isn't found and then just have the core code ignore the
error. Instead the FDW should return normally and, if the core code
needs to know whether the prepared transaction was found, then the FDW
should indicate this through a return value, not an ERROR.

Or do you mean that FDWs should not raise an error if there is the
prepared transaction, and then core code doesn't need to check
sqlstate in case of error?

Right. As noted above, that's unsafe, so we shouldn't do it.

Thank you. I will think the API contract again based on your suggestion.

You're right. Perhaps we can deal with it by PrescanFdwXacts until
reach consistent point, and then have vac_update_datfrozenxid check
local xids of un-resolved fdwxact to determine the new datfrozenxid.
Since the local xids of un-resolved fdwxacts would not be relevant
with vacuuming, we don't need to include it to snapshot and
GetOldestXmin etc. Also we hint to resolve fdwxact when near
wraparound.

I agree with you about snapshots, but I'm not sure about GetOldestXmin.

Hmm, although I've thought concern in case where we don't consider
local xids of un-resolved fdwxact in GetOldestXmin, I could not find
problem. Could you share your concern if you have? I'll try to find a
possibility based on it.

I don't remember exactly what I was thinking when I wrote that, but I
think the point is that GetOldestXmin() does a bunch of things other
than control the threshold for VACUUM, and we'd need to study them all
and look for problems. For example, it won't do for an XID to get
reused while it's still associated with an unresolved fdwxact. We
therefore certainly need such XIDs to hold back the cluster-wide
threshold for clog truncation in some manner -- and maybe that
involves GetOldestXmin(). Or maybe not. But anyway the point,
broadly considered, is that GetOldestXmin() is used in various ways,
and I don't know if we've thought through all of the consequences in
regard to this new feature.

Okay, I'll have GetOldestXmin() consider the oldest local xid of
un-resolved fdwxact as well in the next version patch for more safety,
while considering more efficient ways.

Can I ask what your time frame is for updating this patch?
Considering how much work appears to remain, if you want to get this
committed to v12, it would be best to get started as early as
possible.

I'll post an updated patch by PGCon at the latest, hopefully in the next week.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#187

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#186)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Fri, May 11, 2018 at 9:56 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Thank you for the comment.

On Fri, May 11, 2018 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

Well, FdwXactRegisterForeignServer has a "bool two_phase_commit"
argument. If it only needs to be called by FDWs that support 2PC,
then that argument is unnecessary. If it needs to be called by all
FDWs, then it breaks existing FDWs that don't call it.

I understood now. For now since FdwXactRegisterForeignServer needs to
be called by only FDWs that support 2PC, I will remove the argument.

Regarding to API design, should we use 2PC for a distributed
transaction if both two or more 2PC-capable foreign servers and
2PC-non-capable foreign server are involved with it? Or should we end
up with an error? the 2PC-non-capable server might be either that has
2PC functionality but just disables it or that doesn't have it.

If we use it, the transaction atomicity will be satisfied among only
2PC-capable servers that might be part of all participants. Or If we
don't, we use 1PC instead in such case but when using 2PC transactions
always ends up with satisfying the transaction atomicity among all
participants of the transaction. The current patch takes the former
(doesn't allow PREPARE case though), but I think we also could take
the latter way because it doesn't make sense for user even if the
transaction commit atomically among not all participants.

Also, regardless whether we take either way I think it would be
better to manage not only 2PC transaction but also non-2PC transaction
in the core and add two_phase_commit argument. I think we can use it
without breaking existing FDWs. Currently FDWs manage transactions
using XactCallback but new APIs being added also manage transactions.
I think it might be better if users use either way (using XactCallback
or using new APIs) for transaction management rather than use both
ways with combination. Otherwise two codes for transaction management
will be required: the code that manages foreign transactions using
XactCallback for non-2PC transactions and code that manages them using
new APIs for 2PC transactions. That would not be easy for FDW
developers. So what I imagined for new API is that if FDW developers
use new APIs they can use both 2PC and non-2PC transaction, but if
they use XactCallback they can use only non-2PC transaction.
Any thoughts?

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#188

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 7 years ago

In reply to: Masahiko Sawada (#187)

RE: [HACKERS] Transactions involving multiple postgres foreign servers

From: Masahiko Sawada [mailto:sawada.mshk@gmail.com]

Regarding to API design, should we use 2PC for a distributed
transaction if both two or more 2PC-capable foreign servers and
2PC-non-capable foreign server are involved with it? Or should we end
up with an error? the 2PC-non-capable server might be either that has
2PC functionality but just disables it or that doesn't have it.

but I think we also could take
the latter way because it doesn't make sense for user even if the
transaction commit atomically among not all participants.

I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement issued from an application reports an error. DBMS, particularly relational DBMS (, and even more particularly Postgres?) places high value on data correctness. So I think transaction atomicity should be preserved, at least by default. If we preferred updatability and performance to data correctness, why don't we change the default value of synchronous_commit to off in favor of performance? On the other hand, if we want to allow 1PC commit when not all FDWs support 2PC, we can add a new GUC parameter like "allow_nonatomic_commit = on", just like synchronous_commit and fsync trade-offs data correctness and performance.

Also, regardless whether we take either way I think it would be
better to manage not only 2PC transaction but also non-2PC transaction
in the core and add two_phase_commit argument. I think we can use it
without breaking existing FDWs. Currently FDWs manage transactions
using XactCallback but new APIs being added also manage transactions.
I think it might be better if users use either way (using XactCallback
or using new APIs) for transaction management rather than use both
ways with combination. Otherwise two codes for transaction management
will be required: the code that manages foreign transactions using
XactCallback for non-2PC transactions and code that manages them using
new APIs for 2PC transactions. That would not be easy for FDW
developers. So what I imagined for new API is that if FDW developers
use new APIs they can use both 2PC and non-2PC transaction, but if
they use XactCallback they can use only non-2PC transaction.
Any thoughts?

If we add new functions, can't we just add functions whose names are straightforward like PrepareTransaction() and CommitTransaction()? FDWs without 2PC support returns NULL for the function pointer of PrepareTransaction().

This is similar to XA: XA requires each RM to provide function pointers for xa_prepare() and xa_commit(). If we go this way, maybe we could leverage the artifact of postgres_fdw to create the XA library for C/C++. I mean we put transaction control functions in the XA library, and postgres_fdw also uses it. i.e.:

postgres_fdw.so -> libxa.so -> libpq.so
\-------------/

Regards
Takayuki Tsunakawa

#189

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 7 years ago

In reply to: Robert Haas (#185)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Fri, May 11, 2018 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 27, 2018 at 2:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I might be missing your point. As for API breaking, this patch doesn't
break any existing FDWs. All new APIs I proposed are dedicated to 2PC.
In other words, FDWs that work today can continue working after this
patch gets committed, but if FDWs want to support atomic commit then
they should be updated to use new APIs. The reason why the calling of
FdwXactRegisterForeignServer is necessary is that the core code
controls the foreign servers that involved with the transaction but
the whether each foreign server uses 2PC option (two_phase_commit) is
known only on FDW code. We can eliminate the necessity of calling
FdwXactRegisterForeignServer by moving 2PC option from fdw-level to
server-level in order not to enforce calling the registering function
on FDWs. If we did that, the user can use the 2PC option as a
server-level option.

Well, FdwXactRegisterForeignServer has a "bool two_phase_commit"
argument. If it only needs to be called by FDWs that support 2PC,
then that argument is unnecessary.

An FDW may support 2PC but not a foreign server created using that
FDW. Without that argument all the foreign servers created using a
given FDW will need to support 2PC which may not be possible.

If it needs to be called by all
FDWs, then it breaks existing FDWs that don't call it.

That's true. By default FDWs, which do not want to use this facility,
can just pass false without any need for further change.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#190

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Tsunakawa, Takayuki (#188)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Mon, May 21, 2018 at 10:42 AM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Masahiko Sawada [mailto:sawada.mshk@gmail.com]

Regarding to API design, should we use 2PC for a distributed
transaction if both two or more 2PC-capable foreign servers and
2PC-non-capable foreign server are involved with it? Or should we end
up with an error? the 2PC-non-capable server might be either that has
2PC functionality but just disables it or that doesn't have it.

but I think we also could take
the latter way because it doesn't make sense for user even if the
transaction commit atomically among not all participants.

I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement issued from an application reports an error.

I'm not sure that we should end up with an error in such case, but if
we want then we can raise an error when the transaction tries to
modify 2PC-non-capable server after modified 2PC-capable server.

DBMS, particularly relational DBMS (, and even more particularly Postgres?) places high value on data correctness. So I think transaction atomicity should be preserved, at least by default. If we preferred updatability and performance to data correctness, why don't we change the default value of synchronous_commit to off in favor of performance? On the other hand, if we want to allow 1PC commit when not all FDWs support 2PC, we can add a new GUC parameter like "allow_nonatomic_commit = on", just like synchronous_commit and fsync trade-offs data correctness and performance.

Honestly I'm not sure we should use atomic commit by default at this
point. Because it also means to change default behavior though the
existing users use them without 2PC. But I think control of global
transaction atomicity by GUC parameter would be a good idea. For
example, synchronous_commit = 'global' makes backends wait for
transaction to be resolved globally before returning to the user.

Also, regardless whether we take either way I think it would be
better to manage not only 2PC transaction but also non-2PC transaction
in the core and add two_phase_commit argument. I think we can use it
without breaking existing FDWs. Currently FDWs manage transactions
using XactCallback but new APIs being added also manage transactions.
I think it might be better if users use either way (using XactCallback
or using new APIs) for transaction management rather than use both
ways with combination. Otherwise two codes for transaction management
will be required: the code that manages foreign transactions using
XactCallback for non-2PC transactions and code that manages them using
new APIs for 2PC transactions. That would not be easy for FDW
developers. So what I imagined for new API is that if FDW developers
use new APIs they can use both 2PC and non-2PC transaction, but if
they use XactCallback they can use only non-2PC transaction.
Any thoughts?

If we add new functions, can't we just add functions whose names are straightforward like PrepareTransaction() and CommitTransaction()? FDWs without 2PC support returns NULL for the function pointer of PrepareTransaction().

This is similar to XA: XA requires each RM to provide function pointers for xa_prepare() and xa_commit(). If we go this way, maybe we could leverage the artifact of postgres_fdw to create the XA library for C/C++. I mean we put transaction control functions in the XA library, and postgres_fdw also uses it. i.e.:

postgres_fdw.so -> libxa.so -> libpq.so
\-------------/

I might not understand your comment correctly but the current patch is
implemented in such way. The patch introduces new FDW APIs:
PrepareForeignTransaction, EndForeignTransaction,
ResolvePreparedForeignTransaction and GetPreapreId. The postgres core
calls each APIs at appropriate timings while managing each foreign
transactions. FDWs that don't support 2PC set the function pointers of
them to NULL.

Also, regarding the current API design it might not fit to other
databases than PostgreSQL. For example, in MySQL we have to start xa
transaction explicitly using by XA START whereas PostgreSQL can
prepare the transaction that is started by BEGIN TRANSACTION. So in
MySQL global transaction id is required at beginning of xa
transaction. And we have to execute XA END is required before we
prepare or commit it at one phase. So it would be better to define
APIs according to X/Open XA in order to make it more general.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#191

Tsunakawa, Takayuki

tsunakawa.takay@jp.fujitsu.com

over 7 years ago

In reply to: Masahiko Sawada (#190)

RE: [HACKERS] Transactions involving multiple postgres foreign servers

From: Masahiko Sawada [mailto:sawada.mshk@gmail.com]

I'm for the latter. That is, COMMIT or PREPARE TRANSACTION statement

issued from an application reports an error.

I'm not sure that we should end up with an error in such case, but if
we want then we can raise an error when the transaction tries to
modify 2PC-non-capable server after modified 2PC-capable server.

Such early reporting would be better, but I wonder if we can handle the opposite order: update data on a 2PC-capable server after a 2PC-non-capable server. If it's not easy or efficient, I think it's enough to report the error at COMMIT and PREPARE TRANSACTION, just like we report "ERROR: cannot PREPARE a transaction that has operated on temporary tables" at PREPARE TRANSACTION.

Honestly I'm not sure we should use atomic commit by default at this
point. Because it also means to change default behavior though the
existing users use them without 2PC. But I think control of global
transaction atomicity by GUC parameter would be a good idea. For
example, synchronous_commit = 'global' makes backends wait for
transaction to be resolved globally before returning to the user.

Regarding the incompatibility of default behavior, Postgres has the history to pursue correctness and less confusion, such as the handling of backslash characters in strings and automatic type casts below.

Non-character data types are no longer automatically cast to TEXT (Peter, Tom)
Previously, if a non-character value was supplied to an operator or function that requires text input, it was automatically cast to text, for most (though not all) built-in data types. This no longer happens: an explicit cast to text is now required for all non-character-string types. ... The reason for the change is that these automatic casts too often caused surprising behavior.

I might not understand your comment correctly but the current patch is
implemented in such way. The patch introduces new FDW APIs:
PrepareForeignTransaction, EndForeignTransaction,
ResolvePreparedForeignTransaction and GetPreapreId. The postgres core
calls each APIs at appropriate timings while managing each foreign
transactions. FDWs that don't support 2PC set the function pointers of
them to NULL.

Ouch, you are right.

Also, regarding the current API design it might not fit to other
databases than PostgreSQL. For example, in MySQL we have to start xa
transaction explicitly using by XA START whereas PostgreSQL can
prepare the transaction that is started by BEGIN TRANSACTION. So in
MySQL global transaction id is required at beginning of xa
transaction. And we have to execute XA END is required before we
prepare or commit it at one phase. So it would be better to define
APIs according to X/Open XA in order to make it more general.

I thought of:

* Put the functions that implement xa_prepare(), xa_commit() and xa_rollback() in libxa.so or libpq.so.
* PrepareForeignTransaction and EndForeignTransaction for postgres_fdw call them.

I meant just code reuse for Postgres. But this is my simple intuition, so don't mind.

Regards
Takayuki Tsunakawa

#192

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: Masahiko Sawada (#187)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Regarding to API design, should we use 2PC for a distributed
transaction if both two or more 2PC-capable foreign servers and
2PC-non-capable foreign server are involved with it? Or should we end
up with an error? the 2PC-non-capable server might be either that has
2PC functionality but just disables it or that doesn't have it.

It seems to me that this is functionality that many people will not
want to use. First, doing a PREPARE and then a COMMIT for each FDW
write transaction is bound to be more expensive than just doing a
COMMIT. Second, because the default value of
max_prepared_transactions is 0, this can only work at all if special
configuration has been done on the remote side. Because of the second
point in particular, it seems to me that the default for this new
feature must be "off". It would make to ship a default configuration
of PostgreSQL that doesn't work with the default configuration of
postgres_fdw, and I do not think we want to change the default value
of max_prepared_transactions. It was changed from 5 to 0 a number of
years back for good reason.

So, I think the question could be broadened a bit: how you enable this
feature if you want it, and what happens if you want it but it's not
available for your choice of FDW? One possible enabling method is a
GUC (e.g. foreign_twophase_commit). It could be true/false, with true
meaning use PREPARE for all FDW writes and fail if that's not
supported, or it could be three-valued, like require/prefer/disable,
with require throwing an error if PREPARE support is not available and
prefer using PREPARE where available but without failing when it isn't
available. Another possibility could be to make it an FDW option,
possibly capable of being set at multiple levels (e.g. server or
foreign table). If any FDW involved in the transaction demands
distributed 2PC semantics then the whole transaction must have those
semantics or it fails. I was previous leaning toward the latter
approach, but I guess now the former approach is sounding better. I'm
not totally certain I know what's best here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#193

Masahiko Sawada

sawada.mshk@gmail.com

over 7 years ago

In reply to: Robert Haas (#192)

Re: [HACKERS] Transactions involving multiple postgres foreign servers

On Sat, May 26, 2018 at 12:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, May 18, 2018 at 11:21 AM, Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Regarding to API design, should we use 2PC for a distributed
transaction if both two or more 2PC-capable foreign servers and
2PC-non-capable foreign server are involved with it? Or should we end
up with an error? the 2PC-non-capable server might be either that has
2PC functionality but just disables it or that doesn't have it.

It seems to me that this is functionality that many people will not
want to use. First, doing a PREPARE and then a COMMIT for each FDW
write transaction is bound to be more expensive than just doing a
COMMIT. Second, because the default value of
max_prepared_transactions is 0, this can only work at all if special
configuration has been done on the remote side. Because of the second
point in particular, it seems to me that the default for this new
feature must be "off". It would make to ship a default configuration
of PostgreSQL that doesn't work with the default configuration of
postgres_fdw, and I do not think we want to change the default value
of max_prepared_transactions. It was changed from 5 to 0 a number of
years back for good reason.

I'm not sure that many people will not want to use this feature
because it seems to me that there are many people who don't want to
use the database that is missing transaction atomicity. But I agree
that this feature should not be enabled by default as we disable 2PC
by default.

So, I think the question could be broadened a bit: how you enable this
feature if you want it, and what happens if you want it but it's not
available for your choice of FDW? One possible enabling method is a
GUC (e.g. foreign_twophase_commit). It could be true/false, with true
meaning use PREPARE for all FDW writes and fail if that's not
supported, or it could be three-valued, like require/prefer/disable,
with require throwing an error if PREPARE support is not available and
prefer using PREPARE where available but without failing when it isn't
available. Another possibility could be to make it an FDW option,
possibly capable of being set at multiple levels (e.g. server or
foreign table). If any FDW involved in the transaction demands
distributed 2PC semantics then the whole transaction must have those
semantics or it fails. I was previous leaning toward the latter
approach, but I guess now the former approach is sounding better. I'm
not totally certain I know what's best here.

I agree that the former is better. That way, we also can control that
parameter at transaction level. If we allow the 'prefer' behavior we
need to manage not only 2PC-capable foreign server but also
2PC-non-capable foreign server. It requires all FDW to call the
registration function. So I think two-values parameter would be
better.

BTW, sorry for late submitting the updated patch. I'll post the
updated patch in this week but I'd like to share the new APIs design
beforehand.

APIs that I'd like to add are 4 functions and 1 registration function:
PrepareForeignTransaction, CommitForeignTransaction,
RollbackForeignTransaction, IsTwoPhaseCommitEnabled and
FdwXactRegisterForeignServer. All FDWs that want to support atomic
commit have to support all APIs and to call the registration function
when foreign transaction opens.

Transaction processing sequence with atomic commit will be like follows.

1. FDW begins a transaction on a 2PC-capable foreign server.
2. FDW registers the foreign server with/without a foreign transaction
identifier by calling FdwXactRegisterForeignServer().
* The passing foreign transaction identifier can be NULL. If it's
NULL, the core code constructs it like 'fx_<4 random
chars>_<serverid>_<userid>'.
* Providing foreign transaction identifier at beginning of
transaction is useful because some DBMS such as Oracle database or
MySQL requires a transaction identifier at beginning of its XA
transaction.
* Registration the foreign transaction guarantees that its
transaction is controlled by the core and APIs are called at an
appropriate time.
3. Perform 1 and 2 whenever the distributed transaction opens a
transaction on 2PC-capable foreign servers.
* When the distributed transaction modifies a foreign server, we
mark it as 'modified'.
* This mark is used at commit to check if it's necessary to use 2PC.
* At the same time, we also check if the foreign server enables
2PC by calling IsTwoPhaseCommitEnabled().
* If an FDW disables or doesn't provide that function, we mark
XACT_FALGS_FDWNONPREPARE. This is necessary because we need to
remember wrote 2PC-non-capable foreign server.
* When the distributed transaction modifies temp table locally,
mark XACT_FALGS_WROTENONTEMREL.
* This is used at commit to check i it's necessary to use 2PC as well.
4. During pre-commit, we prepare all foreign transaction if 2PC is
required by calling PrepareFOreignTransaciton()
* If we don't need to use 2PC, we commit all foreign transactions
by calling CommitForeignTransaction() with 'prepared' == false.
* If transaction raises an error during or until pre-commit for
whatever reason, we rollback them calling
RollbackForeignTransaction(). In case of rollback, we could call
RollbackForeignTransaction() with 'prepared' == true but the
corresponding foreign transaction might not exist. This is an API
contract.
5. Local commit
6. Launch a foreign transaction resolver process and wait for it to
resolve all foreign transactions.
* The foreign transactions are resolved according to the status of
local transaction by calling CommitForeignTransaciton or
RollbackForeignTransaction() with 'prepared' == true.
7. After resolved all foreign transactions, the resolver process wake
the waiting backend process up.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center